Row denormaliser (Transpose)

Given a list of key value pairs, like extracted metadata from html meta tags:

To move the values in colums the Row denormalizer step of PDI or Apache Hope can be used:

This will transpose the rows to colums:

I cases there are more values for one key, those values could also be aggregated, like “Concatenate string seperated by ,”

Force Pentaho to use UTF8 with mySQL JDBC Driver

In order to insert/update UTF8 Data from a transformation in an DB field, the JDBC driver configuration has to be explicitly configured with characterEncoding=UTF-8

This then also works with polsih diacritics like “Młotkowski”

Format Dates with Full Month

Given the following JSON:

{ "Available online": "29 January 2020",
“Received": "8 November 2019", "Revised": [ "16 January 2020" ], "Accepted": "23 January 2020", "Publication date": "1 May 2020" }

The Date can be captured with dd MMMM yyyy:

Scraping Elsevier abstract pages

Get a list of Elsevier’s IDs via Crossref

To get a first impression about the content and size I was using the Crossref Rest API facet search:

https://api.crossref.org/works?filter=member:78&facet=published:*

So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.

Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:

https://api.crossref.org/works?filter=member:78,from-created-date:2020-01-01,until-created-date:2020-01-31&select=DOI,created,alternative-id,&rows=1000&cursor=*

In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.

So I get a list with all alternative IDs (like: S0960982219315106) from Elsevier articles. With that I can create an URL to all abstract pages of this article: https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Getting HTML Abstract Page

Using the Pentaho HTTP Client, it’s now possible to get the HTML of the this abstract page.

https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.

Extracting data from abstract HTML

Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:

view-source:https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Loading the whole HTML in one field, the following Regex can be used to extract the json block:

*<script type="application/json" data-iso-key="_0">(.*)</script>
<iframe style="display: none.*

Pentaho Step: Regex evaluation (Content Tab: Activate “Enable dotall mode” & “Enable multiline mode”)

Extracting data from JSON

Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.

Exploring Json-Data and getting JSON-Path via https://jsonpathfinder.com/

For the moment I’m mostly interested in the author and affiliation data:

Extracting Affiliations

In the JSON we can extract:

whole affiliation section
individual affiliation
id, label, name (textfn) of affiliation

Extracting Authors

extracting the authors works similar like the affiliation.

whole author section: $..[?(@.#name==’author’)]
individual author: $.$.id ; $.$.orcid ; $..$$.[?(@.#name==’given-name’)]._
label: $.$$.[?(@.#name==’cross-ref’)].$$.[?(@.#name==’sup’)]._

An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:

Merge affiliations and authors

In a third step authors and affiliations are merged by the the label:

in order to get a list with authors and affiliations

Extracted authors and affiliations from Elsevier articles

Unespace / Convert HTML Character

When working with HTML you may want to change a string from HTML to default encoding (and vice versa). Pentaho data integration provides that option within the calculator step:

This allows the following change:

Detlef Günther <-> Detlef Günther

Replace string with JavaScript

Alternatively to the “Replace in string”-Step, the “Modified JavaScript value”-Step can be used to replace something in a string. eg. here to replace a carriage return with a white space:

var Author = replace(Author_seperated, “\r”,” “)

Remove a pattern in string with RegEx

In the “Replace in string” step you can also use RegEx to remove variable content.

https://regexr.com/ can be use to try your Regex

Maximum of several numbers using Java Script

Using the “Modified JavaScript value”-step you can call functions in java script, like Math.max().

Extract email from website

How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?

https://doi.org/10.1039/C7CS00709D as rendered HTML

https://doi.org/10.1039/C7CS00709D (excerpt of HTML source code)

Get the HTML of the publications via REST Step, store it in one field.
Extract email via “Regex evaluation” step using the Regex
.*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
with the step options:
- Enable dotall mode
- Enable multiline mode

The first email appearing in the HTML will put into the filed email.

Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:

Add a sub-transformation with mapping steps

In the last post I created a sub-transformation with a “transformation executor” step. It works, but I had to look up the results from the sub-transformation in a later step. However, Pentaho Data Integration (PDI) however offers a more elegant way to add sub-transformation.

I will use the same example as previously.

a) Sub-Transformation

In your sub-transformation you insert a “Mapping input specific” step at the beginning of your sub-transformation and define in this step what input fields you expect. At the end you add an “Mapping output specification” step, where you don’t have to specify anything.

Add Mapping steps a the beginning and and of the sub-transformation <
Publication_Date_Sub_Mapping.ktr

b) Parent/Main-transformation

So in the main transformation you can add the step “Simple mapping (sub-transformation)”.

sub-transformation in the category Mapping

In this step you can map the fields of the parent transformation to the expected fields that you have defined in the input step of the sub-transformation. If you use the same field names, PDI provides a nice auto-mapping feature in the step options: “Mapping…” -> “Guess…”

Adding “Simple mapping (sub-transformation)” step in Parent/Main transformation – Publication_Date_Main_Mapping.ktr

It is not necessary to specify the “Output” tab, because in this case all fields created in the sub-transformation become available in the following steps of the super/main transformation.

The advantage here is that the fields that you have not passed on to the sub-transformation are directly available in the following steps of the partial/main transformation.