Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Split first and lastname

For splitting a Name in First and Lastname I found the following simple Regex working in most cases:
(.*?)([^\s]*)$

Indeed you have to evaluate manually if you have names with more than 3 words

LASTNAME in Capitals + no speparation

In another case I came across a textline, where the firstname was in Capital, however the firstname was not easy seperable by the following words.

I found the the following Regex (including already the exceptions of the existing data) would work in most my cases:

^([A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-'de]{2,20}\s(?:[A-ZÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ\-']{2,15})?)\s*?([^\s]+\s(?:Huy|Christine|Flora|Deborah|Gösta)?)(.*)

Repeat/Fill value of previous row

Sometimes you want to fil empty values in a row, with the last occurence of the colum, that is not null. Eg in the following example I want to fill rows 44-50 with the event_date “26.06.2023” an rows 53-54 with “28.08.2023”.

For that a simple java script can be used:

var event_date_new; if (event_date !== null) {  event_date_new = event_date;}

This results in a new column with all dates filled

Using Cookies for REST/HTTP Calls

Some websites require cookies to access certain corners:

When visting a page (the first time), the generated cookie can be read out with the response_header field:

The response_header might look something like this:

The cookie can be read out using the json-path “$.Set-Cookie”

To use the Cookie again it should be added as Header to any HTTP/Rest call, using the header name “Cookie”

Row normaliser

Having a table with several colums that should be transposed to rows, the row normaliser step can be used:


Type field: can be anything

Fieldname: Here put the colum header names

Type: can be anything

new field: this will be the field where the value of the column will be transposed.

Result:

Note, that fields, which are not added in the rows normaliser step as fields, will simply be added to the output without normalization.

Converting Dates – Masking

It’s a common task to convert dates found in external sources to the PDI/Apache Hop date format. While PDI/Apache Hop offers some typcial conversion masks as a drop down, you may need to create an own one. Here a cheat sheet how to use the masking parameters:

SymbolMeaningTypeExample
GEraText“GG” -> “AD”
yYearNumber“yy” -> “03””yyyy” -> “2003”
MMonthText or Number“M” -> “7””M” -> “12””MM” -> “07””MMM” -> “Jul””MMMM” -> “December”
dDay in monthNumber“d” -> “3””dd” -> “03”
hHour(1-12, AM/PM)Number“h” -> “3””hh” -> “03”
HHour (0-23)Number“H” -> “15””HH” -> “15”
kHour (1-24)Number“k” -> “3””kk” -> “03”
KHour (0-11, AM/PM)Number“K” -> “15””KK” -> “15”
mMinuteNumber“m” -> “7””m” -> “15””mm” -> “15”
sSecondNumber“s” -> “15””ss” -> “15”
SMillisecond (0-999)Number“SSS” -> “007”
EDay in weekText“EEE” -> “Tue””EEEE” -> “Tuesday”
DDay in year (1-365 or 1-364)Number“D” -> “65””DDD” -> “065”
FDay of week in month (1-5)Number“F” -> “1”
wWeek in year (1-53)Number“w” -> “7”
WWeek in month (1-5)Number“W” -> “3”
aAM/PMText“a” -> “AM””aa” -> “AM”
zTime zoneText“z” -> “EST””zzz” -> EST””zzzz” -> Eastern Standard Time”
XTime zone offsetText“XXX” -> “-08:00”
Escape for textDelimiter“hour’h” -> “hour 9”
Single quoteLiteral“ss”SSS” -> “45’876”. Use two quote marks in a row to create a single quote in a string.
Source: https://help.hitachivantara.com/Documentation/Pentaho/9.2/Products/Common_Formats

Run PDI or Apache Hop on Apple M1

Currently PDI and Apache Hop do not run natively on M1. There are reports using the ARM optimised Azul JDK, but this was not working for me, as the newest STW Library is not yet ARM compatible.

Instead the following procedure worked for me on my Apple Macbook M1 (1st Gen).

Installation JDK 8.

If other JVMs are installed, they should be selected as follows:

Which Java versions do I have installed?

/usr/libexec/java_home -V

results in my case Matching Java Virtual Machines (3):

16.0.2 (x86_64) "Oracle Corporation" - "Java SE 16.0.2" /Library/Java/JavaVirtualMachines/jdk-16.0.2.jdk/Contents/Home

1.8.0_302 (arm64) "Azul Systems, Inc." - "Zulu 8.56.0.23" /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home

1.8.0_301 (x86_64) "Oracle Corporation" - "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home

Set Java_HOME Path correctly:

unset JAVA_HOME

export JAVA_HOME=$(/usr/libexec/java_home -v "1.8.0_301")

Check selected JVM
java -version

echo $JAVA_HOME

Set terminal to Intel mode according to: https://stackoverflow.com/questions/67972804/pentaho-data-integration-not-starting-on-new-mac-m1

And replace SWT.jar library (according to previous step)

With that changes, you can start PDI and Apache Hop on M1