Extract PDF Content with Tika

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

It allows you to output the content as Plain text, XML, HTML and JSON:

Result:

Format Dates with Full Month

Given the following JSON:

{ "Available online": "29 January 2020",
“Received": "8 November 2019", "Revised": [ "16 January 2020" ], "Accepted": "23 January 2020", "Publication date": "1 May 2020" }

The Date can be captured with dd MMMM yyyy:

Scraping Elsevier abstract pages

Get a list of Elsevier’s IDs via Crossref

To get a first impression about the content and size I was using the Crossref Rest API facet search:

https://api.crossref.org/works?filter=member:78&facet=published:*

So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.

Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:

https://api.crossref.org/works?filter=member:78,from-created-date:2020-01-01,until-created-date:2020-01-31&select=DOI,created,alternative-id,&rows=1000&cursor=*

In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.

So I get a list with all alternative IDs (like: S0960982219315106) from Elsevier articles. With that I can create an URL to all abstract pages of this article: https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Getting HTML Abstract Page

Using the Pentaho HTTP Client, it’s now possible to get the HTML of the this abstract page.

https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.

Extracting data from abstract HTML

Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:

view-source:https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Loading the whole HTML in one field, the following Regex can be used to extract the json block:

*<script type="application/json" data-iso-key="_0">(.*)</script>
<iframe style="display: none.*

Pentaho Step: Regex evaluation (Content Tab: Activate “Enable dotall mode” & “Enable multiline mode”)

Extracting data from JSON

Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.

Exploring Json-Data and getting JSON-Path via https://jsonpathfinder.com/

For the moment I’m mostly interested in the author and affiliation data:

Extracting Affiliations

In the JSON we can extract:

whole affiliation section
individual affiliation
id, label, name (textfn) of affiliation

Extracting Authors

extracting the authors works similar like the affiliation.

whole author section: $..[?(@.#name==’author’)]
individual author: $.$.id ; $.$.orcid ; $..$$.[?(@.#name==’given-name’)]._
label: $.$$.[?(@.#name==’cross-ref’)].$$.[?(@.#name==’sup’)]._

An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:

Merge affiliations and authors

In a third step authors and affiliations are merged by the the label:

in order to get a list with authors and affiliations

Extracted authors and affiliations from Elsevier articles

Add a sub-transformation with step “transformation executor”

You can bundle a couple of steps as a transformation and call those steps in another transformation.

My scenario: determine publication date

I often use the Crossref Rest API to get information about publications. Depending on the publisher there are different kind of dates associated with a DOI and the dates can have different resolutions. Sometimes just a year or a year and a month.

Get publications dates from different DOIs (using REST and JSON) – Simple_Rest_Query_Crossref.ktr

In order to get always a specific publication date with the resolution YYYY.mm.dd I use a couple of steps and logic to determine the “relevant” publication date out from those different date fields.

Adding a couple of steps to determine “publication_date”

To reuse those steps in different transformations without copying each time all these steps I can now save those steps as own transformation. Let’s add a “Get rows from result” and “Copy rows to result” at the beginning and and end this sub-transformation.

subtransformation – Publication_Date_Sub.ktr

Then we can add a “Transformation executor” step in the main transformation. In this step we add the expected “fields” of the sub-transformation in the tab “Results row”

Adding a “transformation executor”-Step in the main transformation – Publication_Date_Main.ktr

As output of a “transformation executor” step there are several options available:

Output-Options of “transformation executor”-Step

There seems to be no option to get the results and pass through the input steps data for the same rows. Probably since the output of the sub-transformation can have more more or less rows than the input. Yet we can create a work-around by going on the the input data and add the results of the sub-transformation with a common (presorted) identifier. At the end we have the original data and the result of the sub-transformation combined.

JSON-Input

Pentaho Data Integration (PDI) offers the input step “JSON-Input” to read out data from a JSON file or stream. Often I use this step after a REST-API-Query, so I would have the JSON-Input as a field from a previous step.

In order to test the field-extraction, it’s helpful to save some local samples of the possible responses. In the tab “File” you first can do your tests with the local file and switch later to “Source is from a previous step”.

Since recently Pentaho offers an “internal helper” to select the fields. However it unfortunately does not to work for most of my use-cases. Instead I found http://jsonpathfinder.com/ very useful.

Get JSON Path to the data you want to extract via *http://jsonpathfinder.com/*

Then add the fields with the corresponding path in PDI:

Extracting various fields from the JSON response of the Crossref REST API (e.g. http://api.crossref.org/works/10.1002/2016gl068428)

Import from several Excel-Files (XLSX)

To merge several similar Excel Files in one stream, you can read out a directory with the the wildcard:

.*\.xlsx

Microsoft Excel Input – Multiple files within a directory

For large XLSX Files use the “Apache POI Streaming” engine (not “Apache POI”)