Skip to content

Pentaho & Apache Hop Hints

Notes about using Pentaho Data Integration and Apache Hop as Non-Programmer

Home
About
Contact

Extract PDF Content with Tika

Christian Gutknecht Apache Tika January 26, 2022January 26, 2022 1 Minute

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

Defining Input

It allows you to output the content as Plain text, XML, HTML and JSON:

Output Options

Result:

Share this:

Twitter
Facebook

Like Loading...

Related

Tagged
apache
apache hop
Apache Tika
extract
fileformat
pdf
pdf-to-text
tika

Published by Christian Gutknecht

View all posts by Christian Gutknecht

Published January 26, 2022January 26, 2022

Post navigation

Previous Post Inititial capital of a sentence

Next Post Converting Dates – Masking

Leave a comment Cancel reply

Δ

Categories

Flow (2)
- Transformation executor (2)
Infrastructure (1)
Input (6)
- Apache Tika (1)
- JSON Input (4)
- Microsoft Excel Input (1)
Job (1)
- Copy rows to result (1)
- Get rows from result (1)
Lookup (3)
- REST client (3)
- Stream lookup (1)
Mapping (1)
- Mapping input specification (1)
- Mapping output specification (1)
- Simple mapping (sub-transformation) (1)
Output (1)
- insert / update (1)
Scripting (12)
- Formula (1)
- Modified JavaScript value (3)
- Regex evaluation (5)
- Shell (1)
- User defined Java expression (3)
Statistics (1)
- Group by (1)
Transform (7)
- Calculator (2)
- Concat fields (1)
- Replace in string (1)
- Row denormaliser (1)
- Row normaliser (1)
- Split field to rows (1)
Uncategorized (2)

Tags

affiliation algebra apache hop apache poi authors bash calculation carriage return Characters Command Line compare numbers comparison concat copy rows corresponding author crossref dataGrid datamining date date format Elsevier email Encoding extract Format Conversion Formula fuzzy HTML HTML encoding java script Job json json path Mapping max maximum merge Metadata names numbers pattern pentaho pentaho data integration persons publication_date query Regex remove replace Replace in string REST rsc ScienceDirect Scraping separator several files Shell similarity split string strings subtransformation TDM Terminal text textmining Tidy transpose Unescape uri urlencode white space wildcard xlsx XML

Archives

October 2023
September 2023
May 2023
March 2022
January 2022
November 2021
July 2021
June 2021
July 2020
June 2020
May 2020
April 2020
January 2020
July 2019
May 2019
April 2019

Recent Posts

Split field to rows with lookahead regex
Split first and lastname
Repeat/Fill value of previous row
Using Cookies for REST/HTTP Calls
Row normaliser

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy

Comment
Reblog
Subscribe Subscribed
- Pentaho & Apache Hop Hints
- Already have a WordPress.com account? Log in now.

Loading Comments...

Write a Comment...

Email (Required)

Name (Required)

Website

%d

Design a site like this with WordPress.com