Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
224 views
in Technique[技术] by (71.8m points)

python - Beautifulsoup and AJAX-table problem

I am making a script that scrapes the games of the Team Liquid database of international StarCraft 2 games. (http://www.teamliquid.net/tlpd/sc2-international/games)

However I come accros a problem. I have my script looping through all the pages, however the Team Liquid site uses some kind of AJAX I think in the table to update it. Now when I use BeautifulSoup I can't get the right data.

So I loop through these pages:

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC etc...

When you open these yourself you see different pages, however my script keeps getting the same first page every time. I think this is because when opening the other pages you see some loading thing for a small amount of time updating the table with games to the correct page. So I guess beatifulsoup is to fast and needs to wait for the loading and updating of the table to be done.

So my question is: How can i make sure it takes the updated table?

I now use this code to get the contents of the table, after which I put the contents in a .csv:

html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
                and tag['id']=="tblt_table") 
rows = table.findAll(lambda tag: tag.name=='tr')
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

When you try to scrape a site using AJAX, it's best to see what the javascript code actually does. In many cases it simply retrieves XML or HTML, which would be even easier to scrape than the non-AJAXy content. It just requires looking at some source code.

In your case, the site retrieves the HTML code for the table control by itself (instead of refreshing the whole page) from a special URL and dynamically replaces it in the browser DOM. Looking at http://www.teamliquid.net/tlpd/tabulator/ajax.js, you'd see this URL is formatted like this:

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

So all you need to do is to scrape this URL directly with BeautifulSoup and advance the tabulator_page counter each time you want the next page.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...