You can extract a pattern as an additional field with Pentaho using the “Regex evaluation” step.
Example to extract the Arxiv-ID with this Regex: .*(\d{4}\.\d{4,5}).*

You can extract a pattern as an additional field with Pentaho using the “Regex evaluation” step.
Example to extract the Arxiv-ID with this Regex: .*(\d{4}\.\d{4,5}).*