Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Split first and lastname

For splitting a Name in First and Lastname I found the following simple Regex working in most cases:
(.*?)([^\s]*)$

Indeed you have to evaluate manually if you have names with more than 3 words

LASTNAME in Capitals + no speparation

In another case I came across a textline, where the firstname was in Capital, however the firstname was not easy seperable by the following words.

I found the the following Regex (including already the exceptions of the existing data) would work in most my cases:

^([A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-'de]{2,20}\s(?:[A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-']{2,15})?)\s*?([^\s]+\s(?:Huy|Christine|Flora|Deborah|Gösta)?)(.*)

Repeat/Fill value of previous row

Sometimes you want to fil empty values in a row, with the last occurence of the colum, that is not null. Eg in the following example I want to fill rows 44-50 with the event_date “26.06.2023” an rows 53-54 with “28.08.2023”.

For that a simple java script can be used:

var event_date_new; if (event_date !== null) {  event_date_new = event_date;}

This results in a new column with all dates filled

Row normaliser

Having a table with several colums that should be transposed to rows, the row normaliser step can be used:

Type field: can be anything

Fieldname: Here put the colum header names

Type: can be anything

new field: this will be the field where the value of the column will be transposed.

Result:

Note, that fields, which are not added in the rows normaliser step as fields, will simply be added to the output without normalization.

Converting Dates – Masking

It’s a common task to convert dates found in external sources to the PDI/Apache Hop date format. While PDI/Apache Hop offers some typcial conversion masks as a drop down, you may need to create an own one. Here a cheat sheet how to use the masking parameters:

Symbol	Meaning	Type	Example
G	Era	Text	“GG” -> “AD”
y	Year	Number	“yy” -> “03””yyyy” -> “2003”
M	Month	Text or Number	“M” -> “7””M” -> “12””MM” -> “07””MMM” -> “Jul””MMMM” -> “December”
d	Day in month	Number	“d” -> “3””dd” -> “03”
h	Hour(1-12, AM/PM)	Number	“h” -> “3””hh” -> “03”
H	Hour (0-23)	Number	“H” -> “15””HH” -> “15”
k	Hour (1-24)	Number	“k” -> “3””kk” -> “03”
K	Hour (0-11, AM/PM)	Number	“K” -> “15””KK” -> “15”
m	Minute	Number	“m” -> “7””m” -> “15””mm” -> “15”
s	Second	Number	“s” -> “15””ss” -> “15”
S	Millisecond (0-999)	Number	“SSS” -> “007”
E	Day in week	Text	“EEE” -> “Tue””EEEE” -> “Tuesday”
D	Day in year (1-365 or 1-364)	Number	“D” -> “65””DDD” -> “065”
F	Day of week in month (1-5)	Number	“F” -> “1”
w	Week in year (1-53)	Number	“w” -> “7”
W	Week in month (1-5)	Number	“W” -> “3”
a	AM/PM	Text	“a” -> “AM””aa” -> “AM”
z	Time zone	Text	“z” -> “EST””zzz” -> EST””zzzz” -> Eastern Standard Time”
X	Time zone offset	Text	“XXX” -> “-08:00”
‘	Escape for text	Delimiter	“hour’h” -> “hour 9”
”	Single quote	Literal	“ss”SSS” -> “45’876”. Use two quote marks in a row to create a single quote in a string.

Source: https://help.hitachivantara.com/Documentation/Pentaho/9.2/Products/Common_Formats

Extract PDF Content with Tika

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

It allows you to output the content as Plain text, XML, HTML and JSON:

Result:

Inititial capital of a sentence

provided a sentence like:

neue Technologien zur Verkehrsoptimierung

should get a inital capital like:

Neue Technologien zur Verkehrsoptimierung

This can be achieved in Pentaho / Apache Hop by a java expression

text == null ? null : text.substring(0,1).toUpperCase() + text.substring(1)

from: https://forums.pentaho.com/threads/218295-Capitalize-first-letter-of-a-string/

Workflow executor

The step “Workflow executor” with a pipeline in Apache Hop allows you to execute the same workflow for several rows. Values from those rows can be passed down to the workflow as parameters. E.g.

Parent Workflow:

Sub-Workflow (executed for each row), using parameters

Run PDI or Apache Hop on Apple M1

Currently PDI and Apache Hop do not run natively on M1. There are reports using the ARM optimised Azul JDK, but this was not working for me, as the newest STW Library is not yet ARM compatible.

Instead the following procedure worked for me on my Apple Macbook M1 (1st Gen).

Installation JDK 8.

If other JVMs are installed, they should be selected as follows:

Which Java versions do I have installed?

/usr/libexec/java_home -V

results in my case Matching Java Virtual Machines (3):

16.0.2 (x86_64) "Oracle Corporation" - "Java SE 16.0.2" /Library/Java/JavaVirtualMachines/jdk-16.0.2.jdk/Contents/Home

1.8.0_302 (arm64) "Azul Systems, Inc." - "Zulu 8.56.0.23" /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home

1.8.0_301 (x86_64) "Oracle Corporation" - "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home

Set Java_HOME Path correctly:

unset JAVA_HOME

export JAVA_HOME=$(/usr/libexec/java_home -v "1.8.0_301")

Check selected JVM
java -version

echo $JAVA_HOME

Set terminal to Intel mode according to: https://stackoverflow.com/questions/67972804/pentaho-data-integration-not-starting-on-new-mac-m1

And replace SWT.jar library (according to previous step)

With that changes, you can start PDI and Apache Hop on M1