The Linux Rain Linux General/Gaming News, Reviews and Tutorials

Liferea hack: add links to ABC (Australia) news items

By Bob Mesibov, published 21/02/2018 in Tutorials


Most of the RSS feeds in my Liferea RSS reader have external links in their brief, descriptive texts, like this one from The Guardian newspaper:

News feeds from the Australian Broadcasting Corporation (ABC), Australia's national broadcaster, don't have links:

To see the the online ABC news article in Liferea's browser corresponding to a headline, you have to double-click on the headline, or right-click on it and and choose "Open In Browser". This works because the article's URL is indeed included in the RSS file, although you can't see it.

To get that link into the descriptive text, I wrote a simple hack with AWK, described in detail below. Please note that this hack requires that you have GNU AWK (gawk 4) installed in your system, because I've used backreferences in my command, and some AWK versions don't support backreferencing.

To add links to an ABC news feed, right-click on the feed in the left-hand Liferea pane (the list of feeds) and choose Properties. On the Source tab, check "Use conversion filter", then paste the following AWK one-liner into the "Convert using:" box and click "OK".

awk '/<item>/ {a=NR+2} NR==a {b=gensub(/<link>(.*)<\/link>/,"<a href=\"\\1\">Link to article</a>","g"); c=a+3} NR==c {$0=gensub(/(<p>.*)(<\/p)/,"\\1 <br />"b"\\2", "g")} 1'

You may need to restart Liferea, but the ABC items will now have Link to article in each item's descriptive text

and when you click on that link the article will open in the Liferea browser:

Liferea's filter

A conversion filter in Liferea acts just like a command in a pipeline in a shell. The RSS XML file is fed to the filter as standard input and the standard output is command-modified XML which is read by the Liferea XML engine. Although there are ready-made scripts for modifying feeds (see the Liferea documentation), you actually only need a modifying command or commands for your filter.

Structure of the RSS XML

My hack takes advantage of the consistent structure of the ABC's XML (in early 2018). Here's the structure of every individual news item in the file:

The URL for the online article is inside the "link" markup on line 3 of the news item, and the text displayed by Liferea in the right-hand viewing pane is on line 6 of the item, within the "p" paragraph markup in the "description" section. All we need to do is copy the URL from line 3 and put it into a hyperlink at the end of the line 6 text. For this kind of text-processing, AWK is the tool of choice.

Step 1

To find an "item" line, I use an AWK regex search:

/<item>/

when that line is found, AWK stores its line number (NR) plus 2 in a variable, "a":

/<item>/ {a=NR+2}

Step 2

AWK continues reading, line by line. It does nothing more until it reaches the second line after the "item" line.

NR==a

AWK now substitutes with its gensub function whatever is inside the "link" markup

gensub(/<link>(.*)<\/link>/

with the correct HTML for a hyperlink and the link text "Link to article":

gensub(/<link>(.*)<\/link>/,"<a href=\"\\1\">Link to article</a>"

Note that the forward slash in "</link>" and the double quotes in <a href="[the link]"> have to be escaped, and note that the link itself appears as an escaped back reference, "\1". The substitution with the gensub function takes a generalised "g" as a third argument and operates on the whole "link" line by default.

The marked-up hyperlink is stored in the variable "b". AWK then counts another 3 lines ahead, storing that next line number in the variable "c":

{b=gensub(/<link>(.*)<\/link>/,"<a href=\"\\1\">Link to article</a>","g",$0); c=a+3}

Step 3

AWK continues reading. When it reaches the third line after the one from which it built the hyperlink

NR==c

it does another substitution, this time actually replacing the line:

$0=gensub...

In the output of the substitution, the leading "p" and the description text that follows are stored in the backreference "\1". That's followed by a space and an HTML line break (<br />), then the variable "b" (outside the quoted bits in the output, otherwise it would just appear as the letter "b"), then the closing "p" markup that's been stored in the backreference "\2".

{$0=gensub(/(<p>.*)(<\/p)/,"\\1 <br />"b"\\2", "g")}

The final part of the command is a "1", which tells AWK to go through the whole command at every line of the file. Note that AWK won't actually do anything until it finds a "<item>" line or the line numbers derived from the previous "item" line number. When AWK reaches a new "item" line and starts working, the variables "a", "b" and "c" are automatically refreshed with new values.

Not a "forever" hack

If the ABC changes their RSS XML structure my AWK command may no longer work, but the command should be tweakable to a new structure. And the hack will become unnecessary if the ABC web devs get around to putting links in the descriptive text, like most other news feeds.



About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: scripting liferea rss hacks awk commandline tutorials
blog comments powered by Disqus