Extract email from website

How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?

https://doi.org/10.1039/C7CS00709D as rendered HTML
https://doi.org/10.1039/C7CS00709D (excerpt of HTML source code)
  1. Get the HTML of the publications via REST Step, store it in one field.
  2. Extract email via “Regex evaluation” step using the Regex
    .*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
    with the step options:
    • Enable dotall mode
    • Enable multiline mode

The first email appearing in the HTML will put into the filed email.

Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:

Add a sub-transformation with mapping steps

In the last post I created a sub-transformation with a “transformation executor” step. It works, but I had to look up the results from the sub-transformation in a later step. However, Pentaho Data Integration (PDI) however offers a more elegant way to add sub-transformation.

I will use the same example as previously.

a) Sub-Transformation

In your sub-transformation you insert a “Mapping input specific” step at the beginning of your sub-transformation and define in this step what input fields you expect. At the end you add an “Mapping output specification” step, where you don’t have to specify anything.

Add Mapping steps a the beginning and and of the sub-transformation <
Publication_Date_Sub_Mapping.ktr

b) Parent/Main-transformation

So in the main transformation you can add the step “Simple mapping (sub-transformation)”.

sub-transformation in the category Mapping

In this step you can map the fields of the parent transformation to the expected fields that you have defined in the input step of the sub-transformation. If you use the same field names, PDI provides a nice auto-mapping feature in the step options: “Mapping…” -> “Guess…”

Adding “Simple mapping (sub-transformation)” step in Parent/Main transformation – Publication_Date_Main_Mapping.ktr

It is not necessary to specify the “Output” tab, because in this case all fields created in the sub-transformation become available in the following steps of the super/main transformation.

The advantage here is that the fields that you have not passed on to the sub-transformation are directly available in the following steps of the partial/main transformation.

Add a sub-transformation with step “transformation executor”

You can bundle a couple of steps as a transformation and call those steps in another transformation.

My scenario: determine publication date

I often use the Crossref Rest API to get information about publications. Depending on the publisher there are different kind of dates associated with a DOI and the dates can have different resolutions. Sometimes just a year or a year and a month.

Get publications dates from different DOIs (using REST and JSON) – Simple_Rest_Query_Crossref.ktr

In order to get always a specific publication date with the resolution YYYY.mm.dd I use a couple of steps and logic to determine the “relevant” publication date out from those different date fields.

Adding a couple of steps to determine “publication_date”

To reuse those steps in different transformations without copying each time all these steps I can now save those steps as own transformation. Let’s add a “Get rows from result” and “Copy rows to result” at the beginning and and end this sub-transformation.

subtransformation – Publication_Date_Sub.ktr

Then we can add a “Transformation executor” step in the main transformation. In this step we add the expected “fields” of the sub-transformation in the tab “Results row”

Adding a “transformation executor”-Step in the main transformation – Publication_Date_Main.ktr

As output of a “transformation executor” step there are several options available:

Output-Options of “transformation executor”-Step

There seems to be no option to get the results and pass through the input steps data for the same rows. Probably since the output of the sub-transformation can have more more or less rows than the input. Yet we can create a work-around by going on the the input data and add the results of the sub-transformation with a common (presorted) identifier. At the end we have the original data and the result of the sub-transformation combined.