Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Split first and lastname

For splitting a Name in First and Lastname I found the following simple Regex working in most cases:
(.*?)([^\s]*)$

Indeed you have to evaluate manually if you have names with more than 3 words

LASTNAME in Capitals + no speparation

In another case I came across a textline, where the firstname was in Capital, however the firstname was not easy seperable by the following words.

I found the the following Regex (including already the exceptions of the existing data) would work in most my cases:

^([A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-'de]{2,20}\s(?:[A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-']{2,15})?)\s*?([^\s]+\s(?:Huy|Christine|Flora|Deborah|Gösta)?)(.*)

Repeat/Fill value of previous row

Sometimes you want to fil empty values in a row, with the last occurence of the colum, that is not null. Eg in the following example I want to fill rows 44-50 with the event_date “26.06.2023” an rows 53-54 with “28.08.2023”.

For that a simple java script can be used:

var event_date_new; if (event_date !== null) {  event_date_new = event_date;}

This results in a new column with all dates filled

Inititial capital of a sentence

provided a sentence like:

neue Technologien zur Verkehrsoptimierung

should get a inital capital like:

Neue Technologien zur Verkehrsoptimierung

This can be achieved in Pentaho / Apache Hop by a java expression

text == null ? null : text.substring(0,1).toUpperCase() + text.substring(1)

from: https://forums.pentaho.com/threads/218295-Capitalize-first-letter-of-a-string/

Scraping Elsevier abstract pages

Get a list of Elsevier’s IDs via Crossref

To get a first impression about the content and size I was using the Crossref Rest API facet search:

https://api.crossref.org/works?filter=member:78&facet=published:*

So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.

Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:

https://api.crossref.org/works?filter=member:78,from-created-date:2020-01-01,until-created-date:2020-01-31&select=DOI,created,alternative-id,&rows=1000&cursor=*

In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.

So I get a list with all alternative IDs (like: S0960982219315106) from Elsevier articles. With that I can create an URL to all abstract pages of this article: https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Getting HTML Abstract Page

Using the Pentaho HTTP Client, it’s now possible to get the HTML of the this abstract page.

https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.

Extracting data from abstract HTML

Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:

view-source:https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Loading the whole HTML in one field, the following Regex can be used to extract the json block:

*<script type="application/json" data-iso-key="_0">(.*)</script>
<iframe style="display: none.*

Pentaho Step: Regex evaluation (Content Tab: Activate “Enable dotall mode” & “Enable multiline mode”)

Extracting data from JSON

Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.

Exploring Json-Data and getting JSON-Path via https://jsonpathfinder.com/

For the moment I’m mostly interested in the author and affiliation data:

Extracting Affiliations

In the JSON we can extract:

whole affiliation section
individual affiliation
id, label, name (textfn) of affiliation

Extracting Authors

extracting the authors works similar like the affiliation.

whole author section: $..[?(@.#name==’author’)]
individual author: $.$.id ; $.$.orcid ; $..$$.[?(@.#name==’given-name’)]._
label: $.$$.[?(@.#name==’cross-ref’)].$$.[?(@.#name==’sup’)]._

An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:

Merge affiliations and authors

In a third step authors and affiliations are merged by the the label:

in order to get a list with authors and affiliations

Extracted authors and affiliations from Elsevier articles

Replace string with JavaScript

Alternatively to the “Replace in string”-Step, the “Modified JavaScript value”-Step can be used to replace something in a string. eg. here to replace a carriage return with a white space:

var Author = replace(Author_seperated, “\r”,” “)

Maximum of several numbers using Java Script

Using the “Modified JavaScript value”-step you can call functions in java script, like Math.max().

Extract email from website

How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?

https://doi.org/10.1039/C7CS00709D as rendered HTML

https://doi.org/10.1039/C7CS00709D (excerpt of HTML source code)

Get the HTML of the publications via REST Step, store it in one field.
Extract email via “Regex evaluation” step using the Regex
.*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
with the step options:
- Enable dotall mode
- Enable multiline mode

The first email appearing in the HTML will put into the filed email.

Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:

Regex Evaluation – Arxiv ID

You can extract a pattern as an additional field with Pentaho using the “Regex evaluation” step.

Example to extract the Arxiv-ID with this Regex: .*(\d{4}\.\d{4,5}).*

The found regex will be added as new field to the stream.

Execute a shell script

To trigger a shell script or a terminal command after a transformation, you have to create a job (it’s not available in a transformation). In the following scenario I wanted to transform a HTML-File to XML using tidy.

So I define a job, where the file is created and use the step “Execute a shell script…”

then enter the tidy command:

tidy -asxhtml -numeric < file_old.html > file_new.xml

in the next tab “Script”:

Provided tidy hasn’t failed, the “file_old.html” has been converted to “file_new.xml” in your job directory.