Row denormaliser (Transpose)

Given a list of key value pairs, like extracted metadata from html meta tags:

To move the values in colums the Row denormalizer step of PDI or Apache Hope can be used:

This will transpose the rows to colums:

I cases there are more values for one key, those values could also be aggregated, like “Concatenate string seperated by ,”

Advertisement

Concat Strings

There are several possibilities in Pentaho Data Integration to put together several fields:

a) Step: Concat fields

This step is very straightforward if you want to concat fields using the same separator:

Concat Last and First name to a field “person_p3”, with separator “, “

b) Step: Formula

Similar to Excel you can create a new string using your fields and ad hoc defined strings in the Formula step:

c) Step: User Defined Java Expression

Similarity of person names

If you want to compare strings using fuzzy logic, you either can use the step “Fuzzy match” or calculate the similarity within the “Calculator” step.

Calculator Step: Testing various algorithms

To comparing person names I found the “JaroWinkler similitude” algorithm with a score > 0.75 providing acceptable results:

Results after calculation of similarities (sorted by Jaro Winkler)

Note: In this example “Grams, C. M” is obviously similar to “Grams, Christian Michael Warnfried”. With the Levenshtein distance, this similarity would not have been found.

In order to filter out false positives, you can run additionally the similarity check also on just the last name only.