The byte at 15344 is 0x96. Presumably at position 15343 there is either a single-byte encoding of a character, or the last byte of a multiple-byte encoding, making 15344 the start of a character. 0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding.
Hence the stream is either not UTF-8 or else is corrupted.
Examining the URI you link to, we find the header:
Content-Type: text/html
Since there is no encoding stated, we should use the default for HTTP, which is ISO-8859-1 (aka "Latin 1").
Examining the content we find the line:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Which is a fall-back mechanism for people who are, for some reason, unable to set their HTTP headings correctly. This time we are explicitly told the character encoding is ISO-8859-1.
As such, there's no reason to expect reading it as UTF-8 to work.
For extra fun though, when we consider that in ISO-8859-1 0x96 encodes U+0096 which is the control character "START OF GUARDED AREA" we find that ISO-8859-1 isn't correct either. It seems the people creating the page made a similar error to yourself.
From context, it would seem that they actually used Windows-1252, as in that encoding 0x96 encodes U+2013 (EN-DASH, looks like –
).
So, to parse this particular page you want to decode in Windows-1252.
More generally, you want to examine headers when picking character encodings, and while it would perhaps be incorrect in this case (or perhaps not, more than a few "ISO-8859-1" codecs are actually Windows-1252), you'll be correct more often. You still need to have something catch failures like this by reading with a fallback. The decode
method takes a second parameter called errors
. The default is 'strict'
, but you can also have 'ignore'
, 'replace'
, 'xmlcharrefreplace'
(not appropriate), 'backslashreplace'
(not appropriate) and you can register your own fallback handler with codecs.register_error()
.