ProcrastiBlog: Software

So Muxtape is a pretty cool site, but a little frustrating. If a friend posts a really cool mixtape (maybe you know somebody who just barely entered the Aughties), it would be nice to be able to download it and save it, just like all those old cassette mixtapes sentimentally rotting underneath your bed.

Enter muxrip. This simple Ruby script takes the name of the mixtape, downloads it, and creates a playlist for you in M3U or iTunes format. (Acknowledgments: the script basically just adds some polish to this previous effort.)

PLEASE: Use this script responsibly. It would be a shame for Muxtape to get shut down.

ALSO: I wouldn't be surprised if this suddenly stopped working. It depends on elements of the page layout and URL scheme that might (almost certainly will) change without notice.

I've been teaching myself a bit of Python by the just-in-time learning method: start programming, wait for the interpreter to complain, and go check the reference manual; keep the API docs on your hard disk and sift through them when you need a probably-existing function. Recently, I wanted to write a very simple script to manipulate some XML (see below) and I was surprised (though it has been noted before) at the relatively confused state of the art in Python and XML.

First of all, the Python XML API documentation is more or less "go read the W3C standards." Which is fine, but... make the easy stuff easy, people.

Secondly, the supposedly-standard PyXML library has been deprecated in some form or fashion such that some of the examples from the tutorial I was working with have stopped working (in particular, the xml.dom.ext module has gone somewhere. Where, I do not know).

So, in the interest of producing more and better code samples for future lazy programmers, here's how I managed to solve my little problem.

The Problem: Twitter's RSS feeds don't provide clickable links

The Solution: A script suitable for use as a "conversion filter" in Liferea (and maybe other feed readers too, who knows?). The script should:

Read and parse an RSS/Atom feed from the standard input.

Grab the text from the feed items and "linkify" them

Print the modified feed on the standard output.

Easy, right? Well, yeah. The only tricky bit was using the right namespace references for the Atom feed, but again that's only because I refuse to read and comprehend the W3C specs for something so insignificant. I ended up using the lxml library, because it worked. (The script would be about 50% shorter if I hadn't added a command-line option --strip-user to strip the username from the beginning of items in a single-user feed and a third shorter than that if it only handled RSS or Atom and not both.)

Here's the code, in toto. (You can download it here.)

#! /usr/bin/env python

from sys import stdin, stdout
from lxml import etree
from re import sub
from optparse import OptionParser

doc = etree.parse(stdin)

def addlinks(path,namespaces=None):
    for node in doc.xpath(path,namespaces=namespaces):
        # Turn URLs into HREFs
        node.text = sub("((https?|s?ftp|ssh)\:\/\/[^\"\s\<\>]*[^.,;'\">\:\s\<\>\)\]\!])",
                        "<a href=\"\\1\">\\1</a>",
                        node.text)
        # Turn @ refs into links to the user page
        node.text = sub("\B@([_a-z0-9]+)",
                        "@<a href=\"http://twitter.com/\\1\">\\1</a>",
                        node.text)

def stripuser(path,namespaces=None):
    for node in doc.xpath(path,namespaces=namespaces):
        node.text = sub("^[A-Za-z0-9_]+:\s*","",node.text)

parser = OptionParser(usage = "%prog [options] SITE")
parser.add_option("-s", "--strip-username", 
                  action="store_true", 
                  dest="strip_username",
                  default=False,
                  help="Strip the username from item title and description")
(opts,args) = parser.parse_args()

# For RSS feeds
addlinks("//rss/channel/item/description")
# For Atom feeds
addlinks( "//n:feed/n:entry/n:content", 
          {'n': 'http://www.w3.org/2005/Atom'} )

if opts.strip_username:
     # RSS title/description
     stripuser( "//rss/channel/item/title" )
     stripuser( "//rss/channel/item/description" )
     # Atom title/description
     stripuser( "//n:feed/n:entry/n:title", 
                namespaces = {'n': 'http://www.w3.org/2005/Atom'} )
     stripuser( "//n:feed/n:entry/n:content", 
                namespaces = {'n': 'http://www.w3.org/2005/Atom'} )

doc.write(stdout)

If there are any Python programmers in the audience and I'm doing something stupid or terribly non-idiomatic, I'd be glad to know.

Thanks in part to Alan H whose Yahoo Pipe was almost good enough (it doesn't handle authenticated feeds, as far as I can tell) and from whom I ripped off the regular expressions.

[UPDATE] Script changed per first commenter.

ProcrastiBlog

Monday, June 23, 2008

Ripping a Muxtape

Sunday, June 08, 2008

Tweaking an RSS Feed in Python

About Me

Subscribe

Twitter Updates

Labels

Blog Archive

Links

Blogroll