Unespace / Convert HTML Character

When working with HTML you may want to change a string from HTML to default encoding (and vice versa). Pentaho data integration provides that option within the calculator step:

This allows the following change:

Detlef Günther <-> Detlef G√ľnther
Advertisement

Similarity of person names

If you want to compare strings using fuzzy logic, you either can use the step “Fuzzy match” or calculate the similarity within the “Calculator” step.

Calculator Step: Testing various algorithms

To comparing person names I found the “JaroWinkler similitude” algorithm with a score > 0.75 providing acceptable results:

Results after calculation of similarities (sorted by Jaro Winkler)

Note: In this example “Grams, C. M” is obviously similar to “Grams, Christian Michael Warnfried”. With the Levenshtein distance, this similarity would not have been found.

In order to filter out false positives, you can run additionally the similarity check also on just the last name only.