In the “Replace in string” step you can also use RegEx to remove variable content.


In the “Replace in string” step you can also use RegEx to remove variable content.
How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?
The first email appearing in the HTML will put into the filed email.
Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:
You can extract a pattern as an additional field with Pentaho using the “Regex evaluation” step.
Example to extract the Arxiv-ID with this Regex: .*(\d{4}\.\d{4,5}).*
To merge several similar Excel Files in one stream, you can read out a directory with the the wildcard:
.*\.xlsx
For large XLSX Files use the “Apache POI Streaming” engine (not “Apache POI”)