Tuesday, November 10, 2009

Wikipedia Graph Generator

I've been working on this project for the past month or so, but since it pretty much works I've lost interest. For those interested, this sort of graph is an example of generated output, and shows pages as vertices and inter-page links as directed edges. The scripts can be found here: wikidown-20091110.zip

Since the data set is quite large (the PostgreSQL database dump compressed is over 550 MB), it'd be much easier for anyone who wants to check this out to generate the data. The process is as follows:

Steps

If the above image doesn't load, open the image location in a separate window or tab.

  1. Run psql -U postgres wikidown2 < schema_create2
  2. Download this file from Wikipedia: enwiki-latest-pages-articles.xml.bz2 (5.3 GB)
  3. Run wikixml2csv.py enwiki-latest-pages-articles.xml.bz2 pages.lst links.lst I originally used CSV here, but switched to an ASCII group separator later. The '.csv' suffixes are vestigial.
  4. Run csvlistfilter.sh links.lst links-sorted.lst.
  5. Run csvpagefilter.py pages.lst pages-presorted.lst.
  6. Run sort -k1nru pages-presorted.lst pages-sorted.lst.
  7. Unfortunately, due to the behavior of Python's hash algorithm and Postgres's tree algorithm, there will be a duplicate title. The only real solution is to keep trying the following steps and edit pages.lst accordingly. Remember than whenever you delete a row in pages.lst, you must also delete links in links.lst that point to it.
  8. Run bzip2 links.lst and bzip2 pages.lst.
  9. Run csv2psql.py pages.lst.bz2 links.lst.bz2. This step is considerably faster if both files are mounted in tmpfs (RAM), but only if your machine has enough RAM. tmpfs is not available on Windows.

Once you've run those steps, you can run subgraph.py

No comments: