WARC is often thought of as a useful preservation format for websites and Web content, but it can also be a useful tool in your toolbox for Web maintenance work.

At work we are in the process of migrating a custom site developed over 10 years ago to give it a new home on the Web. The content has proven useful and popular enough over time that it was worth the investment to upgrade and modernize the site.

You can see in the Internet Archive that the Early Americas Digital Archive has been online at least since 2003. You can also see that it hasn’t changed at all since then. It may not seem like it, but that’s a long time for a dynamic site to be available. It speaks to the care and attention of a lot of MITH staff over the years that it’s still running, and that it is even possible to conceive of migrating it to a new location using a content management system that didn’t even exist when the website was born.

As a result of the move the URLs for the authors and documents in the archive will be changing significantly, and there are lots of links to the archive on the Web. Some of these links can even be found in books, so it’s not just a matter of Google updating their indexes when they encounter a permanent redirect. Nevertheless, we do want to create permanent redirects from the old location to the new location so these links don’t break.

If you are the creator of a website, itemizing the types of URLs that need to change may not be a hard thing to do. But for me, arriving on the scene a decade later, it was non-trivial to digest the custom PHP code, look at the database, and the filesystem and determine the full set of URLs that might need to be redirected.

So instead of code spelunking I decided to crawl the website, and then look at the URLs that are present in the crawled data. There are lots of ways to do this, but it occurred to me that one way would be to use wget to crawl the website and generate a WARC file that I could then analyze.

The first step is to crawl the site. wget is a venerable tool with tons of command line options. Thanks to the work of Jason Scott and Archive Team a --warc-file command line option was added a few years ago that serializes the results of the crawl as a single WARC file.

In my case I also wanted to create a mirror copy of the website for access purposes. The mirrored content is an easy way to see what the website looked like without needing to load the WARC file into a player of some kind like Wayback…but more on that below.

So here was my wget command:

wget --warc-file eada --mirror --page-requisites --adjust-extension --convert-links --wait 1 --execute robots=off --no-parent http://mith.umd.edu/eada/

The EADA website isn’t huge, but this ran for about an hour because I decided to be nice to the server with a one second pause between requests. When it was done I had a single WARC file that represented the complete results of the crawl: eada.warc.gz.

With this in hand I could then use Anand Chitipothu and Noufal Ibrahim’s warc Python module to read in the WARC file looking for HTTP responses for HTML pages. The program simply emits the URLs as it goes, and thus builds a complete set of webpage URLs for the EADA website.

import warc

from StringIO import StringIO
from httplib import HTTPResponse

class FakeSocket():
    def __init__(self, response_str):
        self._file = StringIO(response_str)
    def makefile(self, *args, **kwargs):
        return self._file

for record in warc.open("eada.warc.gz"):
    if record.type == "response":
        resp = HTTPResponse(FakeSocket(record.payload.read()))
        if resp.getheader("content-type") == "text/html":
            print record['WARC-Target-URI']

As you can probably see the hokiest part of this snippet is parsing the HTTP response embedded in the WARC data. Python’s httplib wanted the HTTP response to look like a socket connection instead of a string. If you know of a more elegant way of going from a HTTP response string to a HTTP Response object I’d love to hear from you.

I sorted the output and came up with a nice list of URLs for the website. Here is a brief snippet:


The URLs with docs= in them are particularly important because they identify documents in the archive, and seem to form the majority of inbound links. So there still remains work to map the old URLs to the new ones, but now we at least now what they are.

I mentioned earlier that the mirror copy is an easy way to view the crawled content without needing to start a Wayback or equivalent web archive player. But one other useful thing you can on your workstation is download Ilya Kreymer’s WebArchivePlayer for Mac or Windows, start it up, at which point it asks you to select a WARC file to view, which it then lets you view in your browser.

In case you don’t believe me, here’s a demo of this little bit of magic:

As I’ve with other web preservation work I’ve been doing at MITH I then took the WARC file, the mirrored content, the server side code and database export and put them in a bag which I then copied up to MITH’s S3 storage. Will the WARC file stand the test of time? I’m not sure. But the WARC file was useful to me here today. So there’s reason to hope.