Monday, October 2, 2017

Subclassing HTMLParser Class in Python 2

Using HTMLParser class (https://docs.python.org/2/library/htmlparser.html) in Python 2 is rather easy if you don't need to pass parameter to your subclass for custom processing of the HTML tags. But, what if you do? This is rather trivial to do in Python 3, as seen here. The problem with Python 2, if you follow the "normal" way of invoking the parent HTMLParser class as explained at https://stackoverflow.com/questions/2399307/how-to-invoke-the-super-constructor , you would encounter error like this: TypeError: super() argument 1 must be type, not classobj.

Now, how to fix that error? The error culprit is explained at: https://stackoverflow.com/questions/1713038/super-fails-with-error-typeerror-argument-1-must-be-type-not-classobj#1713052. However, it doesn't give us satisfactory fix for the error because you would need to mess with HTMLParser class for that to work. I prefer not to do it. This is where Python's type keyword comes to the rescue. The code below shows how to properly subclass HTMLParser in Python 2, it might not be pretty a.k.a it's a rather quick-hack, but it works.
from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint

class ImgHtmlParser(HTMLParser):
    def __init__(self, path):
        super(type (self), self).__init__()
        self.reset()
        self.fed = []
        self.download_path = path
        print "ImgHtmlParser constructor"

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            print "Start tag:", tag
            for attr in attrs:
                print "     attr:", attr
                if attr[0] == "data-fullres-src":
                    print "image URL: " + attr[1]
                    print "Download Path = " + self.download_path 

I used the type keyword in place of the derived class literal name. It's not foolproof though if ImgHtmlParser class has a child class, but in this case, it doesn't have one. So, we're OK.