parsing - python [lxml] - cleaning out html tags

Question

Welcome To Ask or Share your Answers For Others

parsing - python [lxml] - cleaning out html tags

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

parsing - python [lxml] - cleaning out html tags

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:05:44+0000

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "
".join(etree.XPath("//text()")(document))

Categories

parsing - python [lxml] - cleaning out html tags

parsing - python [lxml] - cleaning out html tags

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags