How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?
- Get the HTML of the publications via REST Step, store it in one field.
- Extract email via “Regex evaluation” step using the Regex
with the step options:
- Enable dotall mode
- Enable multiline mode
The first email appearing in the HTML will put into the filed email.
Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites: