Design a site like this with WordPress.com
Get started
Skip to content

Pentaho & Apache Hop Hints

Notes about using Pentaho Data Integration and Apache Hop as Non-Programmer

  • Home
  • About
  • Contact

Tag: Metadata

Scraping Elsevier abstract pages

Get a list of Elsevier’s IDs via Crossref

To get a first impression about the content and size I was using the Crossref Rest API facet search:

https://api.crossref.org/works?filter=member:78&facet=published:*

So for 2020 there are 542k entries. In order to download metadata data (or in my case just the DOI, date of DOI registration and the Elsevier internal ID) I’m using again the REST API of Crossref.

Actually a Crossref query is limited to provide max 1000 DOIs. Using cursors, it’s however to possible to loop further and get about 100k DOIs before the API times out. So in order to get all publications from 2020 I created a monthly batch based on the created date like:

https://api.crossref.org/works?filter=member:78,from-created-date:2020-01-01,until-created-date:2020-01-31&select=DOI,created,alternative-id,&rows=1000&cursor=*

In PDI I’ve created a job, that handles the cursor and repeats the transformation with the REST Query as long there is a new cursor coming back from Crossref.

Pentaho Job: Looping through cursor

So I get a list with all alternative IDs (like: S0960982219315106) from Elsevier articles. With that I can create an URL to all abstract pages of this article: https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Getting HTML Abstract Page

Using the Pentaho HTTP Client, it’s now possible to get the HTML of the this abstract page.

https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

It’s important that there’s a HTTP-Header like: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5)”, otherwise the Pentaho Client will be routed to a error page.

Extracting data from abstract HTML

Elsevier makes it very easy to extract information about the article without going deep into the HTML-Structure of ScienceDirect. Within the HTML there’s a whole section with structured data in JSON:

view-source:https://www.sciencedirect.com/science/article/abs/pii/S0960982219315106

Loading the whole HTML in one field, the following Regex can be used to extract the json block:

*<script type="application/json" data-iso-key="_0">(.*)</script>
<iframe style="display: none.*
Pentaho Step: Regex evaluation (Content Tab: Activate “Enable dotall mode” & “Enable multiline mode”)

Extracting data from JSON

Having the JSON data block separated from the rest of the HTML, we now can pass it to the JSON-input step. There is really plenty of information and much more than you would see via front end.

Exploring Json-Data and getting JSON-Path via https://jsonpathfinder.com/

For the moment I’m mostly interested in the author and affiliation data:

Extracting Affiliations

In the JSON we can extract:

  1. whole affiliation section
  2. individual affiliation
  3. id, label, name (textfn) of affiliation
JSON Path for affiliations

Extracting Authors

extracting the authors works similar like the affiliation.

  1. whole author section: $..[?(@.#name==’author’)]
  2. individual author: $.$.id ; $.$.orcid ; $..$$.[?(@.#name==’given-name’)]._
  3. label: $.$$.[?(@.#name==’cross-ref’)].$$.[?(@.#name==’sup’)]._
JSON-Path for authors

An author can have multiple labels/refids, emails, contributor-roles. In order to get those flat in a row separated by a coma, a group step can be used, while getting it back into the stream using the “stream lookup”-step:

Merge affiliations and authors

In a third step authors and affiliations are merged by the the label:

in order to get a list with authors and affiliations

Extracted authors and affiliations from Elsevier articles

Advertisement
Christian Gutknecht Group by, JSON Input, Regex evaluation, REST client, Statistics, Stream lookup Leave a comment June 29, 2020 2 Minutes

Categories

  • Flow (2)
    • Transformation executor (2)
  • Infrastructure (1)
  • Input (6)
    • Apache Tika (1)
    • JSON Input (4)
    • Microsoft Excel Input (1)
  • Job (1)
    • Copy rows to result (1)
    • Get rows from result (1)
  • Lookup (2)
    • REST client (2)
    • Stream lookup (1)
  • Mapping (1)
    • Mapping input specification (1)
    • Mapping output specification (1)
    • Simple mapping (sub-transformation) (1)
  • Output (1)
    • insert / update (1)
  • Scripting (9)
    • Formula (1)
    • Modified JavaScript value (2)
    • Regex evaluation (3)
    • Shell (1)
    • User defined Java expression (3)
  • Statistics (1)
    • Group by (1)
  • Transform (5)
    • Calculator (2)
    • Concat fields (1)
    • Replace in string (1)
    • Row denormaliser (1)
  • Uncategorized (1)

Tags

affiliation algebra apache hop apache poi authors bash calculation carriage return Characters Command Line compare numbers comparison concat copy rows corresponding author crossref dataGrid datamining date date format Elsevier email Encoding extract format Format Conversion Formula fuzzy HTML HTML encoding java script Job json json path Mapping max maximum merge Metadata names numbers pattern pentaho persons publication_date query Regex remove replace Replace in string REST rsc ScienceDirect Scraping separator several files Shell similarity string strings subtransformation TDM Terminal text textmining Tidy Unescape uri urlencode utf8 white space wildcard xlsx XML yyyy

Archives

  • March 2022
  • January 2022
  • November 2021
  • July 2021
  • June 2021
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • January 2020
  • July 2019
  • May 2019
  • April 2019

Recent Posts

  • Converting Dates – Masking
  • Extract PDF Content with Tika
  • Inititial capital of a sentence
  • Workflow executor
  • Run PDI or Apache Hop on Apple M1
Blog at WordPress.com.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Pentaho & Apache Hop Hints
    • Already have a WordPress.com account? Log in now.
    • Pentaho & Apache Hop Hints
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...