Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
295 views
in Technique[技术] by (71.8m points)

xml parsing - What is the structure of wikipedia dumps?

I need the list of Hungarian words for a project and the only possible source I found is wikipedia XML dumps. They are really big, I guess I could parse them with a read stream and a SAX parser, but it would be nice to know more about the structure so I could test the code on a small example before running it on the big files. Is there a description somewhere about what structure they use and what the different XML gzip files contain? https://dumps.wikimedia.org/enwiki/latest/ https://dumps.wikimedia.org/huwiki/latest/

question from:https://stackoverflow.com/questions/65929652/what-is-the-structure-of-wikipedia-dumps

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The format is documented here: https://www.mediawiki.org/wiki/Help:Export It looks like this:

  <mediawiki xml:lang="en">
    <page>
      <title>Page title</title>
      <restrictions>edit=sysop:move=sysop</restrictions>
      <revision>
        <timestamp>2001-01-15T13:15:00Z</timestamp>
        <contributor><username>Foobar</username></contributor>
        <comment>I have just one thing to say!</comment>
        <text>A bunch of [[Special:MyLanguage/text|text]] here.</text>
        <minor />
      </revision>
      <revision>
        <timestamp>2001-01-15T13:10:27Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>new!</comment>
        <text>An earlier [[Special:MyLanguage/revision|revision]].</text>
      </revision>
    </page>
    
    <page>
      <title>Talk:Page title</title>
      <revision>
        <timestamp>2001-01-15T14:03:00Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>hey</comment>
        <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
      </revision>
    </page>
  </mediawiki>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...