I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.
First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?
When I land on the base URL, http://www.dailyfinance.com/
, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently
status code. What does this mean?
In simple terms, I was transferred somewhere. The URL for this GET request is the following:
http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO
Now, we test if it works with AAPL by using a simple URL manipulation.
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url
The above gives the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios
page by appending the below to the above code:
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url
When ran, this gives is the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:
from bs4 import BeautifulSoup as bsoup
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row
I then try running this code, only to encounter an error with the following traceback:
File "C:Users
anashiDesktopest.py", line 13, in <module>
div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'
Of note is the line 'NoneType' object...
. This means our target div
does not exist! Egads, but why am I seeing the following?!
There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div
in the page.
I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:
A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:
WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com
here completely. Our new url is as follows:
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Let's try using that in our final scraper!
from bs4 import BeautifulSoup as bsoup
import requests as rq
csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick
r = rq.get(new_url)
soup = bsoup(r.content)
table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row.get_text()
And our raw results for CSCO's financial ratios data is as follows:
Company
Industry
Valuation Ratios
P/E Ratio (TTM)
15.40
14.80
P/E High - Last 5 Yrs
24.00
28.90
P/E Low - Last 5 Yrs
8.40
12.10
Beta
1.37
1.50
Price to Sales (TTM)
2.51
2.59
Price to Book (MRQ)
2.14
2.17
Price to Tangible Book (MRQ)
4.25
3.83
Price to Cash Flow (TTM)
11.40
11.60
Price to Free Cash Flow (TTM)
28.20
60.20
Dividends
Dividend Yield (%)
3.30
2.50
Dividend Yield - 5 Yr Avg (%)
N.A.
1.20
Dividend 5 Yr Growth Rate (%)
N.A.
144.07
Payout Ratio (TTM)
45.00
32.00
Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70
Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60
Growth Rates (%)
Sales - 5 Yr Growth Rate (%)
5.51
5.12
EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90
EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90
EPS - 5 Yr Growth Rate (%)
8.91
9.04
Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94
Financial Strength
Quick Ratio (MRQ)
2.40
2.70
Current Ratio (MRQ)
2.60
2.90
LT Debt to Equity (MRQ)
0.22
0.20
Total Debt to Equity (MRQ)
0.31
0.25
Interest Coverage (TTM)
18.90
19.10
Profitability Ratios (%)
Gross Margin (TTM)
63.20
62.50
Gross Margin - 5 Yr Avg
66.30
64.00
EBITD Margin (TTM)
26.20
25.00
EBITD - 5 Yr Avg
28.82
0.00
Pre-Tax Margin (TTM)
21.10
20.00
Pre-Tax Margin - 5 Yr Avg
21.60
18.80
Management Effectiveness (%)
Net Profit Margin (TTM)
17.10
17.65
Net Profit Margin - 5 Yr Avg
17.90
15.40
Return on Assets (TTM)
8.30
8.90
Return on Assets - 5 Yr Avg
8.90
8.00
Return on Investment (TTM)
11.90
12.30
Return on Investment - 5 Yr Avg
12.50
10.90
Efficiency
Revenue/Employee (TTM)
637,890.00
556,027.00
Net Income/Employee (TTM)
108,902.00
98,118.00
Receivable Turnover (TTM)
5.70
5.80
Inventory Turnover (TTM)
11.30
9.70
Asset Turnover (TTM)
0.50
0.50
[Finished in 2.0s]
Cleaning up the data is up to you.
One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.
Hopefully you learned something from this. Let us know if this helps and good luck.