Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
458 views
in Technique[技术] by (71.8m points)

python - Iteratively parsing HTML (with lxml?)

I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as:

lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59

This then causes everything to stop.

Is there a way to iteratively parse HTML without choking on syntax errors?

At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?

Edit:

This is what I'm currently doing:

context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
    try:
        event, el = context.next()

        # do something

        # remove old elements
        while el.getprevious() is not None:
            del el.getparent()[0]

    except etree.XMLSyntaxError, e:
        print e.msg
        lineno = int(re.search(r'line (d+),', e.msg).group(1))
        remove_line(tfilename, lineno)
        tfile = open(tfilename)
        context = etree.iterparse(tfile, events=('start', 'end'), html=True)
    except KeyError:
        print 'oops keyerror'
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The perfect solution ended up being Python's very own HTMLParser [docs].

This is the (pretty bad) code I ended up using:

class MyParser(HTMLParser):
    def __init__(self):
        self.finished = False
        self.in_table = False
        self.in_row = False
        self.in_cell = False
        self.current_row = []
        self.current_cell = ''
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if not self.in_table:
            if tag == 'table':
                if ('id' in attrs) and (attrs['id'] == 'dgResult'):
                    self.in_table = True
        else:
            if tag == 'tr':
                self.in_row = True
            elif tag == 'td':
                self.in_cell = True
            elif (tag == 'a') and (len(self.current_row) == 7):
                url = attrs['href']
                self.current_cell = url


    def handle_endtag(self, tag):
        if tag == 'tr':
            if self.in_table:
                if self.in_row:
                    self.in_row = False
                    print self.current_row
                    self.current_row = []
        elif tag == 'td':
            if self.in_table:
                if self.in_cell:
                    self.in_cell = False
                    self.current_row.append(self.current_cell.strip())
                    self.current_cell = ''

        elif (tag == 'table') and self.in_table:
            self.finished = True

    def handle_data(self, data):
        if not len(self.current_row) == 7:
            if self.in_cell:
                self.current_cell += data

With that code I could then do this:

parser = MyParser()
for line in myfile:
    parser.feed(line)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...