Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
545 views
in Technique[技术] by (71.8m points)

html - Scrape Yahoo Finance Income Statement with Python

I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple.

The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML.

I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure.

Here is a screenshot of the analysis with Firefox.

My code so far:

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

I tried using

all_strong = soup.find_all("strong")

And then get the 17th element, which happens to be the one containing the figure I want, but this seems far from elegant. Something like this:

all_strong[16].parent.next_sibling
...

Of course, the goal is to use BeautifulSoup to search for the Name of the figure I need (in this case "Net Income") and then grab the figures themselves in the same row of the HTML-table.

I would really appreciate any ideas on how to solve this, keeping in mind that I would like to apply the solution to retrieve a bunch of other data from other Yahoo Finance pages.

SOLUTION / EXPANSION:

The solution by @wilbur below worked and I expanded upon it to be able to get the values for any figure available on any of the financials pages (i. e. Income Statement, Balance Sheet, and Cash Flow Statement) for any listed company. My function is as follows:

def periodic_figure_values(soup, yahoo_figure):

    values = []
    pattern = re.compile(yahoo_figure)

    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")

    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value) * 1000
            values.append(value)

    return values

The yahoo_figure variable is a string. Obviously this has to be the exact same figure name as is used on Yahoo Finance. To pass the soup variable, I use the following function first:

def financials_soup(ticker_symbol, statement="is", quarterly=False):

    if statement == "is" or statement == "bs" or statement == "cf":
        url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
        if not quarterly:
            url += "&annual"
        return BeautifulSoup(requests.get(url).text, "html.parser")

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

Sample usage -- I want to get the income tax expenses of Apple Inc. from the last available income statements:

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

Output: [19121000000, 13973000000, 13118000000]

You could also get the date of the end of the period from the soup and create a dictionary where the dates are the keys and the figures are the values, but this would make this post too long. So far this seems to work for me, but I am always thankful for constructive criticism.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is made a little more difficult because the "Net Income" in enclosed in a <strong> tag, so bear with me, but I think this works:

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

values, in this case, will contain the three table cells in that "Net Income" row (and, I might add, can easily be converted to ints - I just liked that they kept the ',' as strings)

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

When I tested it on Alphabet (GOOG) - it doesn't work because they don't display an Income Statement I believe (https://finance.yahoo.com/q/is?s=GOOG&annual) but when I checked Facebook (FB), the values were returned correctly (https://finance.yahoo.com/q/is?s=FB&annual).

If you wanted to create a more dynamic script, you could use string formatting to format the url with whatever stock symbol you want, like this:

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...