Writing your own wiki analysis

Why you should consider basing your code on WikiTrust

There are two main difficulties in writing a Wikipedia analysis from scratch:
  • The English Wikipedia, which is the Wikipedia most of interest to researchers, is huge. Tools that can work on the English Wikipedia must be able to cope with the huge amount of information efficiently, parsing and processing in a robust way. This is not trivial! The Wikipedia is a sort of Library of Babel (as in the famous book by Borges), where everything that can be written, has been written at most once. Here are some examples:
    • You think of putting consecutive word triples in a hash table, just like that? Think again! Somewhere lies buried a revision that contains million of times the word 'devil' (or some such; my memory is now fuzzy). All these entries end up in the same hash bucket of course - and this is no good for the processing time. 
    • You think of parsing the text a little bit? Good luck! - Wiki markup language is not like a programming language, where there is right and wrong. In wiki markup language, if the wiki engine renders it ok, then it is ok. Any misformatted markup is fine, as long as it looks ok. The official description is not what is used! Wiki markup is the wild west. One is truly thanksful to the people in CS who developed parsing and compiler theory, after looking at what happens when these are disregarded.
  • Tracking text in the Wikipedia is far from obvious. You cannot simply compare a revision with the preceding one, and so forth. Text can be deleted in going from rev0 to rev1, then stay deleted when going from rev1 to rev2, and finally, be re-introduced in rev3. You cannot consider the text reintroduced in rev3 as new: if you do so, then it looks like people who undo spam also introduce text. We measured on the Wikipedia, and we found very many instances where big blocks of text remains deleted for a few revisions before being reintroduced. Also, text often moves around across revisions, so that blocks of text can change relative position: you cannot use text comparison algorithms that can't deal with block moves (such as unix diff and wdiff). And since there is a lot of text in the Wikipedia, the text tracking algorithms need to be fast.
We spent some time getting the text tracking and edit analysis algorithms right and fast, while developing WikiTrust. The result is an infrasructure that can track text and compare revisions across huge wikipedias in a way that is fast, robust, parallelizable, and deals with all kinds of text block moves, deletions and reinsertions, and more.

How to write your analysis on top of WikiTrust

To write a new analysis in WikiTrust, you need to:
  1.  Define a subclass of the Page class, in a file called for instance example_analysis.ml.  This class must have the following methods:
    • add_revision:  this method is passed a new revision, with the main elements (user id, username, text, timestamp, and more) already extracted from the XML.  The method can do what it wants, but a typical example consists in building a revision object, and then processing it.  A revision object makes the pre-parsed text available, among other things.  You can also use methods for comparing and tracking the text.
    • print_id_title: this trivial methods prints the page id, and page title, of the page.
    • eval is called once no more revisions are present, and does any last processing necessary. 
  2. Add a few lines of code to page_factory.ml to ensure that, if you use the command-line option -example_analysis, Page_factory creates pages of the new subclass you just defined in example_analysis.ml. 
  3. Add a few lines in Makefile to ensure your analysis is compiled and linked with all the rest of WikiTrust.
  4. Compile your code:
    • make all
  5. Run it:
    • ./evalwiki -d <destination_dir> -example_analysis <source_file.xml.gz>
  6. You will find the results in the destination directory selected above.

Example: measuring the word-days contribution of users

As an example, suppose you want to compute the word-days of all users.   The word-days is a contribution measure which captures how text the user added, and for long the text was part of the most recent revision.  For instance, if a user inserts 10 words in revision R1, and 6 words are then deleted in revision R2, and the last 4 words are deleted in revision R3, then the word-hours of the user are:

6 * [t(R2) - t(R1)] + 4 * [t(R3) - t(R1)]

Implementing this analysis in WikiTrust is simple: WikiTrust already provides for you all the ingredients needed, including xml text parsing, metadata parsing, text comparison, and text author tracking.  Indeed, the code to compute this analysis has been introduced in the commit aa370290 to our repository.   If you look at that commit, you will see exactly which changes were needed to create this new analysis.  You can run the new analysis with the command:

./evalwiki -d <destination_dir> -eval_contrib <source_file.xml.gz>

Some explanation of the code is as follows.

WikiTrust parses pages from the compressed dump file.  For every page, it creates an object of a class Page, which has two main methods:
  • add_revision is used to add a new revision;
  • eval is called when all the page revisions have been added.
The object of the class page, typically, creates an object of class Revision for each revision it receives.  The act of creating a revision object also causes all the text of the revision to be parsed, so that you can access the list of words, and the list of syntactic units, very easily.  You can also see the list of useful functions for more information.

The class Page is given an open file where it can write whatever output it wishes.  The file has the same name as the wiki dump file passed in input, except that the extension is .out .