I've been teaching myself a bit of Python by the just-in-time learning method: start programming, wait for the interpreter to complain, and go check the reference manual; keep the API docs on your hard disk and sift through them when you need a probably-existing function. Recently, I wanted to write a very simple script to manipulate some XML (see below) and I was surprised (though it has been noted before) at the relatively confused state of the art in Python and XML.
First of all, the Python XML API documentation is more or less "go read the W3C standards." Which is fine, but... make the easy stuff easy, people.
Secondly, the supposedly-standard PyXML library has been deprecated in some form or fashion such that some of the examples from the tutorial I was working with have stopped working (in particular, the xml.dom.ext
module has gone somewhere. Where, I do not know).
So, in the interest of producing more and better code samples for future lazy programmers, here's how I managed to solve my little problem.
The Problem: Twitter's RSS feeds don't provide clickable links
The Solution: A script suitable for use as a "conversion filter" in Liferea (and maybe other feed readers too, who knows?). The script should:
- Read and parse an RSS/Atom feed from the standard input.
- Grab the text from the feed items and "linkify" them
- Print the modified feed on the standard output.
Easy, right? Well, yeah. The only tricky bit was using the right namespace references for the Atom feed, but again that's only because I refuse to read and comprehend the W3C specs for something so insignificant. I ended up using the
lxml library, because it worked. (The script would be about 50% shorter if I hadn't added a command-line option
--strip-user
to strip the username from the beginning of items in a single-user feed and a third shorter than that if it only handled RSS or Atom and not both.)
Here's the code, in toto. (You can
download it here.)
#! /usr/bin/env python
from sys import stdin, stdout
from lxml import etree
from re import sub
from optparse import OptionParser
doc = etree.parse(stdin)
def addlinks(path,namespaces=None):
for node in doc.xpath(path,namespaces=namespaces):
# Turn URLs into HREFs
node.text = sub("((https?|s?ftp|ssh)\:\/\/[^\"\s\<\>]*[^.,;'\">\:\s\<\>\)\]\!])",
"<a href=\"\\1\">\\1</a>",
node.text)
# Turn @ refs into links to the user page
node.text = sub("\B@([_a-z0-9]+)",
"@<a href=\"http://twitter.com/\\1\">\\1</a>",
node.text)
def stripuser(path,namespaces=None):
for node in doc.xpath(path,namespaces=namespaces):
node.text = sub("^[A-Za-z0-9_]+:\s*","",node.text)
parser = OptionParser(usage = "%prog [options] SITE")
parser.add_option("-s", "--strip-username",
action="store_true",
dest="strip_username",
default=False,
help="Strip the username from item title and description")
(opts,args) = parser.parse_args()
# For RSS feeds
addlinks("//rss/channel/item/description")
# For Atom feeds
addlinks( "//n:feed/n:entry/n:content",
{'n': 'http://www.w3.org/2005/Atom'} )
if opts.strip_username:
# RSS title/description
stripuser( "//rss/channel/item/title" )
stripuser( "//rss/channel/item/description" )
# Atom title/description
stripuser( "//n:feed/n:entry/n:title",
namespaces = {'n': 'http://www.w3.org/2005/Atom'} )
stripuser( "//n:feed/n:entry/n:content",
namespaces = {'n': 'http://www.w3.org/2005/Atom'} )
doc.write(stdout)
If there are any Python programmers in the audience and I'm doing something stupid or terribly non-idiomatic, I'd be glad to know.
Thanks in part to
Alan H whose
Yahoo Pipe was
almost good enough (it doesn't handle authenticated feeds, as far as I can tell) and from whom I ripped off the regular expressions.
[UPDATE] Script changed per first commenter.