This is a draft of a talk I’m giving at SIGCIS on October 29, 2017. It’s part of a larger article that I will hopefully publish shortly or drop in a pre-print repository.


As the World Wide Web has become a prominent, if not the predominant, form of global communications and publishing over the last 25 years we have seen the emergence of web archiving as an increasingly important activity. The web is an immensely large and constantly changing information landscape that fundamentally resists the idea of “archiving it all” (Masanès, 2006). The web is also a site for constant breakdown in the form of broken links, failed business models, unsustainable infrastructure, obsolescence and general neglect. Web archiving projects work in varying measures to stem this tide of loss–to save what is deemed worth saving before it is 404 Not Found. In many ways you can think of web archiving as a form of repair or maintenance work that is conducted by archivists in collaboration with each other, as well as tools and infrastructures (Graham & Thrift, 2007 ; Jackson, 2014).

In this presentation I will describe some research I’ve been doing into how web archives are assembled and why I think this matters for historians of technology. What follows is essentially what (Brügger, 2012b) calls a web historiography where the focus is on the web as a particular technology of history rather than a particular history of web technology. The web, and by extension, web archives provide a singular view of life and culture since its inception 25 years ago. Understanding how and why web archives are assembled is an important task for the scholars who are attempting to use them (Maemura, Becker, & Milligan, 2016). As we will see, it is the network of relationships and connections that a web archive is involved with that make it an archive.

By web archives I specifically mean archives of web content, not necessarily archives that are on the web. Brügger distinguishes between three types of content that can be found on the web:

  • digitized: content that has been converted to digital format by some means (image scanning, transcription, etc) and then placed on the web.
  • born-digital: content that is created digital (word processor files, blog posts, social media, digital photographs, etc) and can be naturally found on the web.
  • reborn-digital: is digitized or born-digital content that has been collected and preserved from the web, and then re-presented as part of a web archive.

It is this third category of reborn-digital content that I’m concerned with here. A prime example is the Internet Archive, which I imagine some of you have used as a source of material in your own research. There are now thousands of organizations around the world collecting web content for a variety of archival purposes.

The question of what and how web content ends up in an archive is of historiographical significance, because history is necessarily shaped by the evidence of the past that survives into the present. Since it is physically impossible to archive everything, archives have always contained gaps or silences. Trouillot (1995) provides a framework for thinking about these moments in which these silences enter the archive:

Silences enter the process of historical production at four crucial moments: the moment of fact creation (the making of sources): the moment of fact assembly (the making of archives); the moment of fact retrieval (the making of narratives); and the moment of retrospective significance (the making of history in the final instance).

Given the significance of the making of archives to the making of history, and the abundance of material on the web, how do archivists decide what to save?

Archivists have traditionally used the term appraisal to describe the process of determining the value of records, in order to justify their inclusion into the archive. While notions of value, and the methods for measuring it differ, the activity of appraisal is central to the work of the archivist. To further specify this moment in which content becomes archival Ketelaar (2001) introduced the neologism archivalization as

the conscious or unconscious choice (determined by social and cultural factors) to consider something worth archiving. Archivalization precedes archiving. The searchlight of archivalization has to sweep the world for something to light up in the archival sense, before we can proceed to register, to record, to inscribe it, in short before we archive it.

In order to better understand this process of lighting up web content in web archives I conducted 30 interviews with web archivists, software developers, researchers and activists to discover how they decide to preserve web content. Inspired by the work of Suchman (1995), Star (1999) and Kelty (2008) these were ethnographic interviews that aimed to develop a thick description of how practitioners enact appraisal in their particular work environments.

In the first pass at analysis I coded the jottings and field notes generated. These provided a detailed picture of the sociotechnical environment in which appraisal work is being performed (Summers & Punzalan, 2017). However questions still remained about the particular psychological or social context for the decision making process around moments of archivalization in web archives.

On a second pass I performed a critical discourse analysis on the interview transcripts themselves. I selected critical discourse analysis (CDA) because it offers a theoretical framework for analyzing the way in which participants’ use of language reflects identity formation, figured worlds and communities of practice, while also speaking to the larger sociocultural context that web archiving work is taking place within.

A Discourse is a socially accepted association among ways of using language, of thinking, feeling, believing, valuing, and of acting that can be used to identify oneself as a member of a socially meaningful group or ‘social network’, or to signal (that one is playing) a socially meaningful ‘role’. (J. Gee, 2015, p. 143)

CDA provides a theoretical framework for empirically studying the way that form and function operate in language, and how this analysis can provide insight into social practices. One of CDA’s key proponents is James Gee, whose 7 building tasks provided me with a guide for analyzing my interview transcripts to gain insight into practices of appraisal in web archives (J. P. Gee, 2014). The 7 building tasks include:

  • Significance: how is language used to foreground and background certain things?
  • Activities: how is language being used to enact particular activities?
  • Identity: how is language being used to position specific identities and make them recognizable?
  • Relationships: what relationships are signaled in the use of language?
  • Politics: how are notions of of value and norms established in the use of language?
  • Connections: how is language used to connect and disconnect ideas, activities, objects?
  • Sign systems and knowledge: how does language position (privilege or disprivilege) particular sign systems, or ways of knowing and believing?

There’s not enough time for me to get into all the details of my findings here, but I would like to share a brief look at what this analysis looks like as a way of introducing my key findings. All the names used in the transcriptions are pseudonyms in order to allow the participants to be themselves as much as possible.

Line Speaker Utterance
41 Jim Well Alex helped me get in contact with the employees /
42
Alex was already on the ground with it.
43 Ed Oh okay //
44 Jim and Alex /
45
KNEW /
46
that it was going to be a lot of data /
47
and was like /
48
ok so [be a little more] /
49 Ed           [ahhhh]
50 Jim careful with this

Here I am interviewing Jim, who works at a non-profit web archiving organization. I selected this snippet because it highlights how discourse reflects the relationships that are involved in the appraisal process. Just before this snippet Jim is talking about how he wasn’t sure whether a particular video streaming site could be archived because of the amount of data involved. He sought the advice of his immediate supervisor Ariana, who then brought in Alex, who is the Director of the archive. It turned out that the Director had a connection with a staff person who was working at the video streaming company, who could provide key information about the amount of data that needed to be archived. Here Jim is using the hierarchical, chain-of-command relationships to lend weight and formality to what is actually a much richer set of circular relationships within the organization. The relationships also extended outside the archive and into the organization that had created the video content.

We see this pattern reflected in another interview with Jack, who is an archivist at a large university, who has been working to document the activities of the fracking industry within his state.

Line Speaker Utterance
1 Jack I really see like one of / my next curatorial responsibilities being /
2
not really more crawling or more selecting /
3
but using the connections I’ve made here /
4
to get more contact and more dialogue going with /
5
with the actual communities I’ve been documenting //
6
And I’m a little nervous about how it’s gonna go /
7
because I went ahead and crawled a bunch of stuff /
8
without really doing that in advance //

Here Jack is explicitly describing “connections” or relationships as an essential part of his job as an archivist. Just before this snippet he had finished describing how he got the idea to document fracking from a web archivist at another institution, who was already engaged in documenting fracking in his state. Jack’s interest in documenting environmental issues had developed while working with a mentor at a previous university. Jack wanted to collaborate with this archivist to better document fracking activity as it extends across geopolitical boundaries. He sought the approval from the Associate Dean of the Library who was very supportive of the idea. However as this snippet illustrates Jack sees these professional relationships as necessary but not sufficient for doing the work. He sees dialogue with the communities being documented, in this case activist communities, as an important dimension to the work of web archiving.

In addition to focusing on relationships Gee’s Making Strange Tool is a discourse analysis technique for foregrounding what might otherwise slip into the background:

In any communication, listeners/readers should try to act as if they were outsiders.

The use of crawling and selecting on line 2 is a phrase that Jack uses several times in the interview. Crawling refers to the behavior of software used to collect content from the web. The software that is used to do this is originally referred to as a web spider because of the way it automatically and recursively follows links in web content for some period of time. But web spiders need to be told by a person where to begin crawling, which is the process of selection.

If you are thinking that selection and appraisal sound similar that’s because they are practically synonyms for each other. Both terms are concerned with identifying material that is of enduring value for preservation in an archive. Appraisal speaks to the theory, method or framework that is used for performing the activity of selection.

In physical archives, boxes of paper manuscripts, files, diskettes or hard drives change hands. A retiring researcher donates their personal papers or workstation to an archive. Or a particular business unit transfers a set of material to an archive according to a previously agreed upon record retention program. In either case a relationship between the record creator or owner and the archive is established. This relationship is intrinsic to the appraisal process.

But in web archives this material transaction is not necessary or it is transformed almost beyond recognition. The architecture and infrastructure of the web, as well as the underlying Internet, allow content to be instantly retrieved across vast distances. You only need to know the URL for the resource and to instruct your web client (be it a browser or a crawler) to retrieve it. When it is all working. As noted by Brügger (2012a) the reliability of archived copies of web content is not a given. Features of the HTTP protocol, such as cookies (Barth, 2011) and caching (Fielding, Nottingham, & Reschke, 2014) combined with the rendering capabilities of the client software mean that the idea of a single idealized, canonical representation of a web resource retreats from view. This seeming immateriality of web content is an illusion generated by the very real assemblage of physical networks, computing machinery, storage devices, electrical grids and cooling units that must operate in concert to deliver access.




Berners-Lee (1990)


As we saw with Jack, there is no need to enter into a conversation with a website owner to start archiving web content. When the content is on the web an archivist can start the archiving software, give it a URL, configure the crawling behavior (how far, how long, how much, etc) and let it do its work. The decision of what to crawl is detached from the relationships that have traditionally guided appraisal. But like a ghost limb, Jack still felt the significance of these connections between the archive and the content creators for doing archival work. He wanted to establish them, even if they were not technically necessary. The links of relationships between people have effectively been replaced by hypertext links that provide discoverability and access.

In many ways what this analysis seems to point to is an evolving practice of web archiving where traditional concepts of appraisal are being unbracketed from one context and reapplied in another. Focusing on the objects, be they paper files, boxes, or representations of HTTP transactions, is less at issue than the practices that involve those objects, and their network of interactions. This shift in attention recalls the work of ethnographer and philosopher Annemarie Mol, whose work studying the treatment of atherosclerosis highlights the importance of practice:

It is possible to refrain from understanding objects as the central focus of different people’s perspectives. It is possible to understand them instead as things manipulated in practices. If we do this–if instead of bracketing the practices in which objects are handled we foreground them–this has far reaching effects. Reality multiplies. (Mol, 2002, p. 5)

The web archive is situated among these multiple record realities involving the creators of records with the preservers of records with the users of records.

But to return to the question I started with: what does all this tell us about how content is appraised for websites, and historiography of the web? I think these brief examples highlight just how important it is to maintain the manifold of relationships between record creators and the archive. Appraisal, as it is embodied in the practices of archivists, and encoded into software tools, is a social enterprise that shapes the historical record. Just as the infrastructure of the web enables communication across great geographic distances, it also simultaneously moves to obscure the relationship between the archive and the archived. Further research is needed to discover practices that help bridge this gap and make it legible, while allowing for new conceptions of appraisal to develop and be translated.

If you’re a scholar who uses archives of web content I encourage you to reach out to the archivists you know, and to work with them to help build these practices and ensure that they are collecting the things you value. If you work as part of an organization and want to ensure that your web content is being collected and archives try reaching out to an archivist to let them know of your interest. And of course if you are an archivist, and you are stymied by thinking about archiving web content, there are good reasons for that. The web is a big place, and its hard to know what to collect. Focusing on the relationships you have with the communities you document can help make it more manageable and meaningful.

References

Barth, A. (2011). HTTP state management mechanism (No. 6265). Internet Engineering Task Force. Retrieved from https://tools.ietf.org/html/rfc6265

Berners-Lee, T. (1990). Information management: A proposal. CERN. Retrieved from https://www.w3.org/History/1989/proposal.html

Brügger, N. (2012a). Web historiography and internet studies: Challenges and perspectives. New Media & Society.

Brügger, N. (2012b). When the present web is later the past: Web historiography, digital history, and internet studies. Historical Social Research/Historische Sozialforschung, 102–117.

Fielding, R., Nottingham, M., & Reschke, J. (2014). Hypertext transfer protocol (http/1.1): Caching (No. 7234). Internet Engineering Task Force. Retrieved from https://tools.ietf.org/html/rfc7234

Gee, J. (2015). Social linguistics and literacies: Ideology in discourses (Fifth). Routledge.

Gee, J. P. (2014). How to do discourse analysis: A toolkit. Routledge.

Graham, S., & Thrift, N. (2007). Out of order understanding repair and maintenance. Theory, Culture & Society, 24(3), 1–25.

Jackson, S. J. (2014). Media technologies: Essays on communication, materiality and society. In P. Boczkowski & K. Foot (Eds.),. MIT Press. Retrieved from http://sjackson.infosci.cornell.edu/RethinkingRepairPROOFS(reduced)Aug2013.pdf

Kelty, C. M. (2008). Two bits: The cultural significance of free software. Duke University Press. Retrieved from http://twobits.net/

Ketelaar, E. (2001). Tacit narratives: The meanings of archives. Archival Science, 1(2), 131–141.

Maemura, E., Becker, C., & Milligan, I. (2016). Understanding computational web archives research methods using research objects. In IEEE big data: Computation archival science. IEEE.

Masanès, J. (2006). Web archiving methods and approaches: A comparative study. Library Trends, 54(1), 72–90.

Mol, A. (2002). The body multiple: Ontology in medical practice. Duke University Press.

Star, S. L. (1999). The ethnography of infrastructure. American Behavioral Scientist, 43(3), 377–391.

Suchman, L. (1995). Making work visible. Communications of the ACM, 38(9), 56–64.

Summers, E., & Punzalan, R. (2017). Bots, seeds and people: Web archives as infrastructure. In Proceedings of the 2017 acm conference on computer supported cooperative work and social computing (pp. 821–834). New York, NY, USA: ACM. http://doi.org/10.1145/2998181.2998345

Trouillot, M.-R. (1995). Silencing the past: Power and the production of history. Beacon Press.