You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:
>>> 'u00ad'.encode('utf8')
b'xc2xad'
Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.
I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace'
is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.
Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD
is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt
, I can't decode the data as UTF-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'11SUPPLEMENTAL DISCLOSURE OF NONxadCASH INVESTING AND FINANCING A'
b'CTIVITIES:
ProceedsFromSaleOfIn')
There are two such non-ASCII characters in the file:
>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract0'
b'001654954-17-00055111SUPPLEMENTAL DISCLOSURE OF NONxadCASH I'
b'NVESTING AND FINANCING ACTIVITIES:
',
b'HotelKranichhheMember0001558370-17-00144610memberDHotel Krani'
b'chhhe [Member]Represents information pertaining to Hotel Kranichhxf6h'
b'e.
']
Hotel Kranichhxf6he
decoded as Latin-1 is Hotel Kranichh?he.
There are also several 0xC1 / 0xD1 pairs in the file:
>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (x1cTCCAx1d) recognized during th'
>>> quotes[1].split(b'')[-1][50:130]
b'ributory defined benefit pension plan (the x1cAetna Pension Planx1d) to allow certai'
I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C
and 1D
parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!
There is no codec shipping with Python that would encode 'u201Cu201D'
to b'x1Cx1D'
, making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.
If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:
_map = {
# dashes
0x13: 'u2013', 0x14: 'u2014',
# single quotes
0x18: 'u2018', 0x19: 'u2019',
# double quotes
0x1c: 'u201c', 0x1d: 'u201d',
}
def repair(line, _map=_map):
"""Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
return line.translate(_map)
then apply that to lines you read:
with open(filename, 'r', encoding='latin-1') as f:
repaired = map(repair, f)
fields = next(repaired).strip().split('')
for line in repaired:
yield process_tag_record(fields, line)
Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open()
; that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open()
. Do not use f.readlines()
; you don't need to read the whole file into a list here. Just iterate over the file directly:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = next(f).strip().split('')
for line in f:
yield process_tag_record(fields, line)
If process_tag_record
also splits on tabs, use a csv.reader()
object and avoid splitting each row manually:
import csv
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.reader(f, delimiter='')
fields = next(reader)
for row in reader:
yield process_tag_record(fields, row)
If process_tag_record
combines the fields
list with the values in row
to form a dictionary, just use csv.DictReader()
instead:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.DictReader(f, delimiter='')
# first row is used as keys for the dictionary, no need to read fields manually.
yield from reader