Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
705 views
in Technique[技术] by (71.8m points)

parsing - python [lxml] - cleaning out html tags

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "
".join(etree.XPath("//text()")(document))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...