Inititial capital of a sentence

provided a sentence like:

neue Technologien zur Verkehrsoptimierung

should get a inital capital like:

Neue Technologien zur Verkehrsoptimierung

This can be achieved in Pentaho / Apache Hop by a java expression

text == null ? null : text.substring(0,1).toUpperCase() + text.substring(1)

from: https://forums.pentaho.com/threads/218295-Capitalize-first-letter-of-a-string/

Advertisement

Extract email from website

How to extract the email of the corresponding author of a publication, like: https://doi.org/10.1039/C7CS00709D with Pentaho Data integration?

https://doi.org/10.1039/C7CS00709D as rendered HTML
https://doi.org/10.1039/C7CS00709D (excerpt of HTML source code)
  1. Get the HTML of the publications via REST Step, store it in one field.
  2. Extract email via “Regex evaluation” step using the Regex
    .*mailto:([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).*
    with the step options:
    • Enable dotall mode
    • Enable multiline mode

The first email appearing in the HTML will put into the filed email.

Alternatively the Online Service https://www.convertcsv.com/email-extractor.htm also provides a nice possibility to extract emails from several websites:

Execute a shell script

To trigger a shell script or a terminal command after a transformation, you have to create a job (it’s not available in a transformation). In the following scenario I wanted to transform a HTML-File to XML using tidy.

So I define a job, where the file is created and use the step “Execute a shell script…”

then enter the tidy command:

tidy -asxhtml -numeric < file_old.html > file_new.xml

in the next tab “Script”:

Simple terminal command

Provided tidy hasn’t failed, the “file_old.html” has been converted to “file_new.xml” in your job directory.

Create URL for REST-query

When you want to query a REST API but have values that need url-encoding first, you can use a User Defined Java Expression:

Transformation with three steps:
Generate Rows
User Defined Java Expression
REST query
Generate URL to query Crossref REST API (Download: Transformation)

In this particular example it is necessary as PDI would throw the error (“Illegal character in path”) if you query the following URL unmodified from a string:

Concat Strings

There are several possibilities in Pentaho Data Integration to put together several fields:

a) Step: Concat fields

This step is very straightforward if you want to concat fields using the same separator:

Concat Last and First name to a field “person_p3”, with separator “, “

b) Step: Formula

Similar to Excel you can create a new string using your fields and ad hoc defined strings in the Formula step:

c) Step: User Defined Java Expression