Split field to rows with lookahead regex

Given a HTML Text with alternating paragragraphs, where the second line actually belongs to a the first one:

<p>Simon Burger</p>
<p><a href="https://www.simonburger.com">Website SB</a></p>
<p>Petra Ohnsorg</p>
<p><a href="https://www.petraohnsorg.com">Website PO</a></p>

I want to split the text, so there’s a line for each person. This can be achieve by using a regex with a negative lookahead:

<p>(?!<)

This only splits the first paragraph text when there is no following “<“

The fields of the so created rows can than be retrieved by using another regex:

Execute a shell script

To trigger a shell script or a terminal command after a transformation, you have to create a job (it’s not available in a transformation). In the following scenario I wanted to transform a HTML-File to XML using tidy.

So I define a job, where the file is created and use the step “Execute a shell script…”

then enter the tidy command:

tidy -asxhtml -numeric < file_old.html > file_new.xml

in the next tab “Script”:

Simple terminal command

Provided tidy hasn’t failed, the “file_old.html” has been converted to “file_new.xml” in your job directory.