Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
189 views
in Technique[技术] by (71.8m points)

python - What's wrong with iShares SP 500 ETF's Excel file?

I am trying to parse iShares SP 500 ETF's Excel file, which looks like this:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
...

It seems to be an old XLS Excel type file, but it is an XML file, yet xml.etree.ElementTree is complaining a lot.

I have tried:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name)

and with encoding:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(recover=True))
import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(encoding='utf-8'))

error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 1
rb = xlrd.open_workbook(file_name, encoding_override='utf-8')
print(rb)

error:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'xefxbbxbfxefxbbxbf<?'

but none seems to work at all... could anyone guide me in the right direction?

question from:https://stackoverflow.com/questions/65895726/whats-wrong-with-ishares-sp-500-etfs-excel-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your "XML" file is not well-formed:

  1. It has two BOM characters before the XML declaration.

    To remove (at least) one, see Convert UTF-8 with BOM to UTF-8 with no BOM in Python

  2. Remove the & or change it to &amp; on line 43.

    <ss:Data ss:Type="String">iShares Core S&P 500 ETF</ss:Data>
                                            ^
    
  3. Ditto for the & on line 117,188.

Without the above repairs, your file cannot be parsed as XML (because technically it isn't).

After the above repairs, you'll be able to parse it as XML, and Excel will be able to open it:

enter image description here

See also


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...