Updated; the confusion here is two points:
- the root object is
Relation
, not Document
(in fact, only Relation
and RelationMentionRef
are even used)
- the pb file is actually multiple objects, each varint-delimited, i.e. prefixed by their length expressed as a varint
As such, Relation.parseDelimitedFrom
should work. Processing it manually, I get:
test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed
Old; outdated; exploratory:
I extracted your 4 documents and ran them through a little test rig:
ProcessFile("testNegative.pb");
ProcessFile("testPositive.pb");
ProcessFile("trainNegative.pb");
ProcessFile("trainPositive.pb");
where ProcessFile
first dumps the first 10 bytes as hex, and then tries to process it via a ProtoReader
. Here's the results:
Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt
Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.
Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt
Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.
Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt
Same as above.
Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://stackoverflow.com/q/2152978/23354
CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).
Your data is well and truly garbage. Sorry.
And with the bonus round of test-multiple.pb from comments (after gz decompression):
Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt
This starts identically to testNegative.pb, and hence fails for exactly the same reason.