python - BeautifulSoup get_text does not strip all tags and JavaScript

Question

Welcome To Ask or Share your Answers For Others

python - BeautifulSoup get_text does not strip all tags and JavaScript

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - BeautifulSoup get_text does not strip all tags and JavaScript

I am trying to use BeautifulSoup to get text from web pages.

Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.

import sys
from bs4 import BeautifulSoup

def stripTags(s): return BeautifulSoup(s).get_text()

def stripTagsFromFile(inFile, outFile):
    open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))

def main(argv):
    if len(sys.argv) <> 3:
        print 'Usage:', sys.argv[0], 'input.html output.txt'
        return 1
    stripTagsFromFile(sys.argv[1], sys.argv[2])
    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv))

Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    Education Manager  Job In London With  Caleeda | Great Jobs In Teaching

var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);

Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.

Any help will be much appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:27:33+0000

nltk's clean_html() is quite good at this!

Assuming that your already have your html stored in a variable html like

html = urllib.urlopen(address).read()

then just use

import nltk
clean_text = nltk.clean_html(html)

UPDATE

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

An example on how to achieve this is on this page:

BeatifulSoup4 get_text still has javascript

Categories

python - BeautifulSoup get_text does not strip all tags and JavaScript

python - BeautifulSoup get_text does not strip all tags and JavaScript

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags