It’s a common task to convert dates found in external sources to the PDI/Apache Hop date format. While PDI/Apache Hop offers some typcial conversion masks as a drop down, you may need to create an own one. Here a cheat sheet how to use the masking parameters:
The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.
The transform step is available in a pipeline:
Defining Input
It allows you to output the content as Plain text, XML, HTML and JSON:
The step “Workflow executor” with a pipeline in Apache Hop allows you to execute the same workflow for several rows. Values from those rows can be passed down to the workflow as parameters. E.g.
Parent Workflow:
Sub-Workflow (executed for each row), using parameters
Currently PDI and Apache Hop do not run natively on M1. There are reports using the ARM optimised Azul JDK, but this was not working for me, as the newest STW Library is not yet ARM compatible.
Instead the following procedure worked for me on my Apple Macbook M1 (1st Gen).
In order to insert/update UTF8 Data from a transformation in an DB field, the JDBC driver configuration has to be explicitly configured with characterEncoding=UTF-8
This then also works with polsih diacritics like “Młotkowski”
So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.
Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:
In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.
It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.
Extracting data from abstract HTML
Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:
Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.
An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:
Merge affiliations and authors
In a third step authors and affiliations are merged by the the label:
in order to get a list with authors and affiliations
Extracted authors and affiliations from Elsevier articles
When working with HTML you may want to change a string from HTML to default encoding (and vice versa). Pentaho data integration provides that option within the calculator step: