Get a list of Elsevier’s IDs via Crossref
To get a first impression about the content and size I was using the Crossref Rest API facet search:

So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.
Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:
https://api.crossref.org/works?filter=member:78,from-created-date:2020-01-01,until-created-date:2020-01-31&select=DOI,created,alternative-id,&rows=1000&cursor=*
In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.

So I get a list with all alternative IDs (like: S0960982219315106) from Elsevier articles. With that I can create an URL to all abstract pages of this article: https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106
Getting HTML Abstract Page
Using the Pentaho HTTP Client, it’s now possible to get the HTML of the this abstract page.

It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.

Extracting data from abstract HTML
Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:

Loading the whole HTML in one field, the following Regex can be used to extract the json block:
*<script type="application/json" data-iso-key="_0">(.*)</script> <iframe style="display: none.*

Extracting data from JSON
Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.

For the moment I’m mostly interested in the author and affiliation data:

Extracting Affiliations
In the JSON we can extract:
- whole affiliation section
- individual affiliation
- id, label, name (textfn) of affiliation

Extracting Authors
extracting the authors works similar like the affiliation.
- whole author section: $..[?(@.#name==’author’)]
- individual author: $.$.id ; $.$.orcid ; $..$$.[?(@.#name==’given-name’)]._
- label: $.$$.[?(@.#name==’cross-ref’)].$$.[?(@.#name==’sup’)]._

An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:

Merge affiliations and authors
In a third step authors and affiliations are merged by the the label:

in order to get a list with authors and affiliations
