Sunday, June 08, 2008

Tweaking an RSS Feed in Python

I've been teaching myself a bit of Python by the just-in-time learning method: start programming, wait for the interpreter to complain, and go check the reference manual; keep the API docs on your hard disk and sift through them when you need a probably-existing function. Recently, I wanted to write a very simple script to manipulate some XML (see below) and I was surprised (though it has been noted before) at the relatively confused state of the art in Python and XML.

First of all, the Python XML API documentation is more or less "go read the W3C standards." Which is fine, but... make the easy stuff easy, people.

Secondly, the supposedly-standard PyXML library has been deprecated in some form or fashion such that some of the examples from the tutorial I was working with have stopped working (in particular, the xml.dom.ext module has gone somewhere. Where, I do not know).

So, in the interest of producing more and better code samples for future lazy programmers, here's how I managed to solve my little problem.

The Problem: Twitter's RSS feeds don't provide clickable links

The Solution: A script suitable for use as a "conversion filter" in Liferea (and maybe other feed readers too, who knows?). The script should:


  1. Read and parse an RSS/Atom feed from the standard input.

  2. Grab the text from the feed items and "linkify" them

  3. Print the modified feed on the standard output.


Easy, right? Well, yeah. The only tricky bit was using the right namespace references for the Atom feed, but again that's only because I refuse to read and comprehend the W3C specs for something so insignificant. I ended up using the lxml library, because it worked. (The script would be about 50% shorter if I hadn't added a command-line option --strip-user to strip the username from the beginning of items in a single-user feed and a third shorter than that if it only handled RSS or Atom and not both.)

Here's the code, in toto. (You can download it here.)

#! /usr/bin/env python

from sys import stdin, stdout
from lxml import etree
from re import sub
from optparse import OptionParser

doc = etree.parse(stdin)

def addlinks(path,namespaces=None):
for node in doc.xpath(path,namespaces=namespaces):
# Turn URLs into HREFs
node.text = sub("((https?|s?ftp|ssh)\:\/\/[^\"\s\<\>]*[^.,;'\">\:\s\<\>\)\]\!])",
"<a href=\"\\1\">\\1</a>",
node.text)
# Turn @ refs into links to the user page
node.text = sub("\B@([_a-z0-9]+)",
"@<a href=\"http://twitter.com/\\1\">\\1</a>",
node.text)

def stripuser(path,namespaces=None):
for node in doc.xpath(path,namespaces=namespaces):
node.text = sub("^[A-Za-z0-9_]+:\s*","",node.text)

parser = OptionParser(usage = "%prog [options] SITE")
parser.add_option("-s", "--strip-username",
action="store_true",
dest="strip_username",
default=False,
help="Strip the username from item title and description")
(opts,args) = parser.parse_args()

# For RSS feeds
addlinks("//rss/channel/item/description")
# For Atom feeds
addlinks( "//n:feed/n:entry/n:content",
{'n': 'http://www.w3.org/2005/Atom'} )

if opts.strip_username:
# RSS title/description
stripuser( "//rss/channel/item/title" )
stripuser( "//rss/channel/item/description" )
# Atom title/description
stripuser( "//n:feed/n:entry/n:title",
namespaces = {'n': 'http://www.w3.org/2005/Atom'} )
stripuser( "//n:feed/n:entry/n:content",
namespaces = {'n': 'http://www.w3.org/2005/Atom'} )

doc.write(stdout)


If there are any Python programmers in the audience and I'm doing something stupid or terribly non-idiomatic, I'd be glad to know.

Thanks in part to Alan H whose Yahoo Pipe was almost good enough (it doesn't handle authenticated feeds, as far as I can tell) and from whom I ripped off the regular expressions.

[UPDATE] Script changed per first commenter.

2 comments:

Anonymous said...

Interesting post.

Please note that you should not specify default values for function arguments that are objects - e.g. the "ns" argument being assigned an empty dictionary object. The reason is that the default value is evaluated only once rather than on every call.

See the "important warning" in the official Python tutorial.

Chris said...

Thanks for the tip. I don't think "ns={}" was a real problem (because it wasn't being mutated), but I've fixed the script anyway.