Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
510 views
in Technique[技术] by (71.8m points)

dom - html div nesting? using google fetchurl

I'm trying to grab a table from the following webpage

http://www.bloomberg.com/markets/companies/country/hong-kong/

I have some sample code which was kindly provided by Phil Bozak here:

grabbing table from html using Google script

which grabs the table for this website:

http://www.airchina.com.cn/www/en/html/index/ir/traffic/

As you can see from Phil's code, there is alot of "getElement()" in the code. If i look at the html code for the Air China website. It looks like it's nested four times? that's why the string of .getElement?

Now I look at the source code for the Bloomberg page and its is load with "div"...

the question is can someone show me how to grab the table from this the Bloomberg page?

and just a brief explanation of the theory also would be useful. Thanks a bunch.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Let's flip your question upside down, and start with the theory. Methodology might be a better word for it.

You want to get at something specific in a structured page. To do that, you either need a way to zap right to the element (which can be done if it's labeled in a unique way that we can access), OR you need to navigate the structure more-or-less manually. You already know how to look at the source of a page, so you're familiar with this step. Here's a screenshot of Firefox Inspector, highlighting the element we're interested in.

Screenshot - Firefox Inspector

We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data. We can also see the source:

<table class="ticker_data">

Neat! It's labeled! Unfortunately, that class info gets dropped when we process the HTML in our script. Bummer. If it was id="ticker_data" instead, we could use the getElementByVal() utility from this answer to reach it, and give ourselves some immunity from future restructuring of the page. Put a pin in that - we'll come back to it.

It can help to visualize this in the debugger. Here's a utility script for that - run it in debug mode, and you'll have your HTML document laid out to explore:

/**
 * Debug-run this in the editor to be able to explore the structure of web pages.
 *
 * Set target to the page you're interested in.
 */
function pageExplorer() {
  var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
  var pageTxt = UrlFetchApp.fetch(target).getContentText();
  var pageDoc = Xml.parse(pageTxt,true);
  debugger;  // Pause in debugger - explore pageDoc
}

This is what our page looks like in the debugger:

Screenshot - debugger

You might be wondering what the numbered elements are, since you don't see them in the source. When there are multiples of an element type at the same level in an XML document, the parser presents them as an array, numbered 0..n. Thus, when we see 0 under a div in the debugger, that's telling us that there are multiple <div> tags in the HTML source at that level, and we can access them as an array, for example .div[0].

Ok, theory behind us, let's go ahead and see how we can access the table by brute-force.

Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil's previous answer. I'll do some weird indenting to illustrate the document structure:

...
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
var table = pageDoc.getElement()
             .getElement("body")
               .getElements("div")[0]      // 0-th div under body, shown in debugger
                 .getElements("div")[5]    // 5-th div under there
                   .getElement("div")      // another div
                     .getElement("table"); // finally, our table

As a much more compact alternative to all those .getElement() calls, we can navigate using dot notation.

var table = pageDoc.getElement().body.div[0].div[5].div.table;

And that's that.

Let's go back to that pinned idea. In the debugger, we can see that there are various attributes attached to elements. In particular, there's an "id" on that div[5] that contains the div that contains the table. Remember, in the source we saw "class" attributes, but note that they don't make it this far.

Screenshot - debugger 2

Still, the fact that a kindly programmer put this "id" in place means we can do this, with getDivById() from that earlier question:

var contentDiv = getDivById( pageDoc.getElement().body, 'content' );
var table = contentDiv.div.table;

If they move things around, we might still be able to find that table, without changing our code.

You already know what to do once you have the table element, so we're done here!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...