regex pattern in python for parsing HTML title tags

Question

Welcome To Ask or Share your Answers For Others

regex pattern in python for parsing HTML title tags

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex pattern in python for parsing HTML title tags

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook xe2x80x94 Log in, sign up or learn more']
[]

How would I combine both so that I can take into account any additional parameters passed within the title tag?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T01:11:04+0000

It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.

The regex code:

<title.*?>(.+?)</title>

How it works:

Produces:

['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']

Categories

regex pattern in python for parsing HTML title tags

regex pattern in python for parsing HTML title tags

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags