How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?


- Get the HTML of the publications via REST Step, store it in one field.
- Extract email via “Regex evaluation” step using the Regex
.*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
with the step options:- Enable dotall mode
- Enable multiline mode


The first email appearing in the HTML will put into the filed email.
Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:
