I am learning to use both the re
module and the urllib
module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the title
tag is as follows: <title id="pageTitle">
. To accomodate for the additional id=
, I modified the these_regex
variable as follows: these_regex="<title.+?>(.+?)</title>"
. But this gives the following output:
[]
['Welcome to Facebook xe2x80x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the title
tag?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…