Skip to content

Pentaho & Apache Hop Hints

Notes about using Pentaho Data Integration and Apache Hop as Non-Programmer

Home
About
Contact

Month: January 2022

Extract PDF Content with Tika

The recently released Apache Hop 1.1 version comes with a very handy new transform step, that allows you to extract the metadata and content of different fileformats like PDF using Apache Tika.

The transform step is available in a pipeline:

Defining Input

It allows you to output the content as Plain text, XML, HTML and JSON:

Output Options

Result:

Christian Gutknecht Apache Tika Leave a comment January 26, 2022January 26, 2022 1 Minute

Inititial capital of a sentence

provided a sentence like:

neue Technologien zur Verkehrsoptimierung

should get a inital capital like:

Neue Technologien zur Verkehrsoptimierung

This can be achieved in Pentaho / Apache Hop by a java expression

text == null ? null : text.substring(0,1).toUpperCase() + text.substring(1)

from: https://forums.pentaho.com/threads/218295-Capitalize-first-letter-of-a-string/

Christian Gutknecht User defined Java expression Leave a comment January 6, 2022 1 Minute

Categories

Flow (2)
- Transformation executor (2)
Infrastructure (1)
Input (6)
- Apache Tika (1)
- JSON Input (4)
- Microsoft Excel Input (1)
Job (1)
- Copy rows to result (1)
- Get rows from result (1)
Lookup (3)
- REST client (3)
- Stream lookup (1)
Mapping (1)
- Mapping input specification (1)
- Mapping output specification (1)
- Simple mapping (sub-transformation) (1)
Output (1)
- insert / update (1)
Scripting (12)
- Formula (1)
- Modified JavaScript value (3)
- Regex evaluation (5)
- Shell (1)
- User defined Java expression (3)
Statistics (1)
- Group by (1)
Transform (7)
- Calculator (2)
- Concat fields (1)
- Replace in string (1)
- Row denormaliser (1)
- Row normaliser (1)
- Split field to rows (1)
Uncategorized (2)

Tags

affiliation algebra apache hop apache poi authors bash calculation carriage return Characters Command Line compare numbers comparison concat copy rows corresponding author crossref dataGrid datamining date date format Elsevier email Encoding extract Format Conversion Formula fuzzy HTML HTML encoding java script Job json json path Mapping max maximum merge Metadata names numbers pattern pentaho pentaho data integration persons publication_date query Regex remove replace Replace in string REST rsc ScienceDirect Scraping separator several files Shell similarity split string strings subtransformation TDM Terminal text textmining Tidy transpose Unescape uri urlencode white space wildcard xlsx XML

Archives

October 2023
September 2023
May 2023
March 2022
January 2022
November 2021
July 2021
June 2021
July 2020
June 2020
May 2020
April 2020
January 2020
July 2019
May 2019
April 2019

Recent Posts

Split field to rows with lookahead regex
Split first and lastname
Repeat/Fill value of previous row
Using Cookies for REST/HTTP Calls
Row normaliser

Create a free website or blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy

Subscribe Subscribed
- Pentaho & Apache Hop Hints
- Already have a WordPress.com account? Log in now.

Loading Comments...

Write a Comment...

Email (Required)

Name (Required)

Website

Design a site like this with WordPress.com