Extract PDF Content with Tika

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

Defining Input

It allows you to output the content as Plain text, XML, HTML and JSON:



Output Options

Result:

Inititial capital of a sentence

provided a sentence like:

neue Technologien zur Verkehrsoptimierung

should get a inital capital like:

Neue Technologien zur Verkehrsoptimierung

This can be achieved in Pentaho / Apache Hop by a java expression

text == null ? null : text.substring(0,1).toUpperCase() + text.substring(1)

from: https://forums.pentaho.com/threads/218295-Capitalize-first-letter-of-a-string/