Category: Apache Tika

Extract PDF Content with Tika

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

It allows you to output the content as Plain text, XML, HTML and JSON:

Result:

Christian Gutknecht Apache Tika Leave a comment January 26, 2022January 26, 2022 1 Minute

Tags

affiliation algebra apache hop apache poi authors bash calculation carriage return Characters Command Line compare numbers comparison concat copy rows corresponding author crossref dataGrid datamining date date format Elsevier email Encoding extract Format Conversion Formula fuzzy HTML HTML encoding java script Job json json path Mapping max maximum merge Metadata names numbers pattern pentaho pentaho data integration persons publication_date query Regex remove replace Replace in string REST rsc ScienceDirect Scraping separator several files Shell similarity split string strings subtransformation TDM Terminal text textmining Tidy transpose Unescape uri urlencode white space wildcard xlsx XML