Wikipedia on XO!

A few weeks ago my coworker Scott made a post calling for Peruvian folk heroes to help OLPC out with infrastructure for our Peru deployment, and included a list of projects we’d like to work on. The post began like this:

Peru has ordered over 260,000 OLPC XO-1 laptops. These machines will be running Sugar on GNU/Linux. Forty thousand of these are already in warehouses in Peru, with Sugar builds 656 or 703 installed. That means over a quarter of a million kids will use Sugar/GNU/Linux in the next few months – and you can directly influence their lives! Your software, documentation, support expertise, ideas and insights can improve the education of a vast number of kids.

Wanted: Peruvian Folk Heroes. Will you become one?

From my perspective, the post was a big success. Later that day, Wade Brainerd — a game programmer and OLPC supporter — saw that I’d listed a Spanish Wikipedia snapshot on the list of project ideas and sent a mail asking for more details on the project.

The details are that recently Patrick Collison released a project called wikipedia-iphone, which is a GPL’d program that takes a compressed wikipedia dump (in particular, a single XML file compressed with bzip2), builds an index from each article title into the bzip2 block that stores it, and then allows you to retrieve articles by decompressing just the bzip2 block that’s needed to get at the individual article’s text. It’s an awesome idea, and means you can be carrying around the complete text of a language’s Wikipedia without using too much storage (480M compressed for the full text of the Spanish Wikipedia; 3.5G compressed for the full text of English).

480M is still too large to put on the XO, though, and we have deployments in Uruguay, Peru and Mexico that don’t always have good connectivity. If we could get an archive that was under 100MB, I thought, it would probably be small enough for countries to consider preloading it on all of their XOs by default — providing offline access to the most popular Wikipedia articles for kids who have XOs and their families.

Once Wade and I got started, plenty of help followed. Mel Chua wrote up a wiki page with the state of wikipedia-iphone and what would need to be done, while Wade worked on porting the wikipedia-iphone code from Ruby/Mongrel/Inline-C to Python/BaseHTTPServer/SWIG. Once that was done, Wade started trying to find a renderer from wikitext to HTML that would work well for us. (He chose mwlib.)

Madeleine, Ben Schwartz and I looked at which ranking metrics do a decent job of allowing us to store articles that the user is most likely to want to see — we store that likely subset on the XO, and if an article that falls outside our metric is wanted, we can link out to the local school server or the global net for it, if either is available. Three weeks and over 150 GIT commits later, the finished activity is 98M, has around 30,000 articles and 3,000 images, and is available here.

It was a blast to work on this with a group of volunteers, and to get to know them better in the process. Having a community of passionate and immensely competent developers a stone’s throw away if you can clearly articulate what’s needed and why it’s important is a feature of working at OLPC that I love.

Working on this project has been bringing me back to Eben Moglen‘s keynote at the 2006 Plone Conference (video, transcript). Eben describes the history of attempts to ameliorate human social inequality, and how traditional property being rivalrous and zero-sum has often led those attempts to fail due to the friction and violence that stems from trying to redistribute property from people who have it to people who don’t. He argues that we are past that now; that software is non-rivalrous, and so it’s no longer about “wealth redistribution” because we no longer need to take anything away from anyone; that the utopia of being able to share education and knowledge and the power that comes with them has never been so close. Being part of a team working on distributing hundreds of thousands of copies of Wikipedia to poor kids makes me inclined to agree with him about how close we are.

Eben goes on to describe the moral basis for free software — “If you could make as many loaves of bread as it took to feed the world, by baking one loaf and pressing a button, how could you justify charging more for bread than the poorest people could afford to pay?”

If we are indeed solving epic problems, we should be thankful to the people whose shoulders we’re standing on: the Wikipedians for enabling this transfer of knowledge in the first place; the wikipedia-iphone author for choosing to share his code; the mwlib authors who are reimplementing the mediawiki PHP parser in Python; the determined volunteer OLPC developers who rallied around the idea and followed it through, and finally the creators of this cute green machine that allows us to facilitate a conversation for children to figure out how the world works, why, and how they can shape it for the better.


  1. This sounds awesome.

    Have you considered using another compression format?, 7zip and rzip are both open formats that apparently have much higher compression, although on a quick test of a 2mb .txt book I just did they where actually a bit bigger but fairly close, I don’t have a local copy of Wikipedia to test on a large scale though where they might work better. I also don’t know how well they perform resource wise or if they allow for individual file selection but you might find you can shrink it smaller (or add more content) if they do work.

    I also remember hearing on a Jimmy Wales talk that there was a fork of Spanish Wikipedia due to Wikipedia censorship/advertising possibilities although looking it up on Wikipedia now it seems the competitor is now less popular.

    Wouldn’t mind one of these in English ☺

  2. Hi!

    > Have you considered using another compression format?

    Yes, somewhat described at

    > I also remember hearing on a Jimmy Wales talk that there was a fork of Spanish Wikipedia

    Yeah, I took a look at Enciclopedia Libre, but couldn’t see how to download one of their dumps. They do have substantially fewer articles than the main Wikipedia now.

    > Wouldn’t mind one of these in English.

    That’s coming soon. 🙂 We went for Spanish since all the largest XO deployments are in Spanish-speaking countries, but we tried to keep everything generalized so that other countries can run a few steps to get their own builds.

  3. Hmmmm

    One little thought, I have read that you want to port Wikipedia to the “Activity” System of Sugar…

    With a targetgroup of 6-year olds[XO:1/2]
    you should have a little moment to think again.

    If you would have a Child’s WIKIPEDIA then everything would be perfect. WP is WAY to much for the youngest (even higher classes)…

    In my humble point of view…

    Yours sincerely, FLOSS-ADVOCATE and cs-tech-student………Andreas_P

  4. I agree with you, Andreas. Of course, we don’t have a Child’s Wikipedia, and those of us working on OLPC are too busy doing that to create one. 🙂 If one existed, though, we would certainly distribute it.

    In the meantime, the best we can do for an encyclopedia is distribute the full Wikipedia. I do agree that it’s suboptimal, but it’s much better than nothing.

  5. Pingback: Children in Peru write their own history on Wikipedia by Chris Ball

Leave a Reply

Your email address will not be published. Required fields are marked *