Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Split first and lastname

For splitting a Name in First and Lastname I found the following simple Regex working in most cases:
(.*?)([^\s]*)$

Indeed you have to evaluate manually if you have names with more than 3 words

LASTNAME in Capitals + no speparation

In another case I came across a textline, where the firstname was in Capital, however the firstname was not easy seperable by the following words.

I found the the following Regex (including already the exceptions of the existing data) would work in most my cases:

^([A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-'de]{2,20}\s(?:[A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-']{2,15})?)\s*?([^\s]+\s(?:Huy|Christine|Flora|Deborah|Gösta)?)(.*)

Repeat/Fill value of previous row

Sometimes you want to fil empty values in a row, with the last occurence of the colum, that is not null. Eg in the following example I want to fill rows 44-50 with the event_date “26.06.2023” an rows 53-54 with “28.08.2023”.

For that a simple java script can be used:

var event_date_new; if (event_date !== null) {  event_date_new = event_date;}

This results in a new column with all dates filled

Extract email from website

How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?

https://doi.org/10.1039/C7CS00709D as rendered HTML
https://doi.org/10.1039/C7CS00709D (excerpt of HTML source code)
  1. Get the HTML of the publications via REST Step, store it in one field.
  2. Extract email via “Regex evaluation” step using the Regex
    .*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
    with the step options:
    • Enable dotall mode
    • Enable multiline mode

The first email appearing in the HTML will put into the filed email.

Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:

Execute a shell script

To trigger a shell script or a terminal command after a transformation, you have to create a job (it’s not available in a transformation). In the following scenario I wanted to transform a HTML-File to XML using tidy.

So I define a job, where the file is created and use the step “Execute a shell script…”

then enter the tidy command:

tidy -asxhtml -numeric < file_old.html > file_new.xml

in the next tab “Script”:

Simple terminal command

Provided tidy hasn’t failed, the “file_old.html” has been converted to “file_new.xml” in your job directory.