Add a sub-transformation with mapping steps

In the last post I created a sub-transformation with a “transformation executor” step. It works, but I had to look up the results from the sub-transformation in a later step. However, Pentaho Data Integration (PDI) however offers a more elegant way to add sub-transformation.

I will use the same example as previously.

a) Sub-Transformation

In your sub-transformation you insert a “Mapping input specific” step at the beginning of your sub-transformation and define in this step what input fields you expect. At the end you add an “Mapping output specification” step, where you don’t have to specify anything.

Add Mapping steps a the beginning and and of the sub-transformation <
Publication_Date_Sub_Mapping.ktr

b) Parent/Main-transformation

So in the main transformation you can add the step “Simple mapping (sub-transformation)”.

sub-transformation in the category Mapping

In this step you can map the fields of the parent transformation to the expected fields that you have defined in the input step of the sub-transformation. If you use the same field names, PDI provides a nice auto-mapping feature in the step options: “Mapping…” -> “Guess…”

Adding “Simple mapping (sub-transformation)” step in Parent/Main transformation – Publication_Date_Main_Mapping.ktr

It is not necessary to specify the “Output” tab, because in this case all fields created in the sub-transformation become available in the following steps of the super/main transformation.

The advantage here is that the fields that you have not passed on to the sub-transformation are directly available in the following steps of the partial/main transformation.

Similarity of person names

If you want to compare strings using fuzzy logic, you either can use the step “Fuzzy match” or calculate the similarity within the “Calculator” step.

Calculator Step: Testing various algorithms

To comparing person names I found the “JaroWinkler similitude” algorithm with a score > 0.75 providing acceptable results:

Results after calculation of similarities (sorted by Jaro Winkler)

Note: In this example “Grams, C. M” is obviously similar to “Grams, Christian Michael Warnfried”. With the Levenshtein distance, this similarity would not have been found.

In order to filter out false positives, you can run additionally the similarity check also on just the last name only.