Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Row normaliser

Having a table with several colums that should be transposed to rows, the row normaliser step can be used:

Type field: can be anything

Fieldname: Here put the colum header names

Type: can be anything

new field: this will be the field where the value of the column will be transposed.

Result:

Note, that fields, which are not added in the rows normaliser step as fields, will simply be added to the output without normalization.

Row denormaliser (Transpose)

Given a list of key value pairs, like extracted metadata from html meta tags:

To move the values in colums the Row denormalizer step of PDI or Apache Hope can be used:

This will transpose the rows to colums:

I cases there are more values for one key, those values could also be aggregated, like “Concatenate string seperated by ,”

Unespace / Convert HTML Character

When working with HTML you may want to change a string from HTML to default encoding (and vice versa). Pentaho data integration provides that option within the calculator step:

This allows the following change:

Detlef Günther <-> Detlef Günther

Remove a pattern in string with RegEx

In the “Replace in string” step you can also use RegEx to remove variable content.

https://regexr.com/ can be use to try your Regex

Concat Strings

There are several possibilities in Pentaho Data Integration to put together several fields:

a) Step: Concat fields

This step is very straightforward if you want to concat fields using the same separator:

Concat Last and First name to a field “person_p3”, with separator “, “

b) Step: Formula

Similar to Excel you can create a new string using your fields and ad hoc defined strings in the Formula step:

c) Step: User Defined Java Expression

Similarity of person names

If you want to compare strings using fuzzy logic, you either can use the step “Fuzzy match” or calculate the similarity within the “Calculator” step.

Calculator Step: Testing various algorithms

To comparing person names I found the “JaroWinkler similitude” algorithm with a score > 0.75 providing acceptable results:

Results after calculation of similarities (sorted by Jaro Winkler)

Note: In this example “Grams, C. M” is obviously similar to “Grams, Christian Michael Warnfried”. With the Levenshtein distance, this similarity would not have been found.

In order to filter out false positives, you can run additionally the similarity check also on just the last name only.