Behind the Scenes: Adventures in Digitization

Editor's Note:  We asked one of our digitization specialists in North Carolina to share something about the daily practice of scanning historical documents. This is her response:

The cyberpunk regalia I put on every morning in my mind as I get ready for work.

I am one of those children who grew up in the ‘90s with a computer-geek brother, myself dabbling with his Frankensteinian cast-offs. I can actually remember a time when there was no ubiquitous internet--although I also remember X-Files usenet groups. I coded Geocities web pages from scratch (dedicated to Harry Potter novels, of course) and my brother navigated me by way of text-based user interface to some mysterious server that had “Hakuna Matata” in every language imaginable. I had a rockin’ MySpace page and remember the stink when Facebook let high schoolers join. High schoolers! The end is nigh! I am assuredly part of the digital generation--a digital native, if you will--but in some ways I feel like a hipster grandma shouting at the privileged children to get off her lawn. My life exists as a bridge between an era of information scarcity and our current data explosion. So, in that way, it’s as if I’ve been groomed for digitization my entire life. Who better to sit at the juncture between the physical archive of brick and mortar and the archive of instantly accessible bits and bytes than someone who was raised across that transition?

There’s something truly magical to me about making a document in one place available to hundreds of people half a world away--and there is plenty of material to capture! The Library of Congress alone boasts over 158 million items in their various collections, with over 540 miles of shelf space--that’s 100 miles longer than the entire length of Tennessee. (Interestingly, there’s a common trope of using “Libraries of Congress” as a unit of data measurement.) By contrast, the LoC’s web archive as of March 2014 had collected about 525 terabytes of data, adding 5TB a month. But at some point everything will be digitized. Available. Right when you want it. Thinking of the books, papers, and ephemera of the world as somehow exhaustible in terms of digitization is a little mind-blowing if you’ve ever walked into the claustrophobic stacks of an overstuffed manuscript archive.

An illustration of Moore's Law: A 500GB hard drive from 2007 and a 1TB hard drive from 2010. In 2014, there are 500GB hard drives available that are roughly the size of a deck of cards.

A 1997 estimate of all “traditional” data in the world clocks in at “a few thousand petabytes.” (For reference, a petabyte is 1,048,576 gigabytes. So around 2,098 500GB hard drives for each petabyte. With some generous math we get 1,049,000 500GB hard drives for all “pre-digital era” data. If a 500GB hard drive is roughly the size of a deck of playing cards, that many would cover a football field 116,084 times resulting in a stack that’s 6.8 miles high.) One 2013 article claims that “90% of the world’s data was generated over 2011-2012.” By 2007, “all but 6% of the world’s data had been preserved digitally.” Everyday, 2.5 quintillion bytes of data is generated (2560 petabytes--so 5,368,710 of the 500GB hard drives per day.) A lot of that, of course is a record of every single “Like” on Facebook, but the point is: the information is there. It’s recorded. We’ve culturally conditioned ourselves to expect information to be available at our slightest whim. I know I’m not the only person who runs to Wikipedia at the slightest lapse of memory. (We’re all cyborgs. CYBORGS.) By contrast, the human brain has been conservatively estimated to have a storage capacity of 100 million megabytes--around 196 of our friendly 500GB hard drives. Yet another suggests it might hold 2.5 petabytes (5,243 of our 500GB friends). (To see what any of that might have looked like in the 1950s, watch the wonderful Captain America: The Winter Soldier.)

I'd always presumed anyone could do this kind of work, but I’ve come to realize that, in fact, it takes a rare kind of person. Someone who is very good at high-volume monotony. Someone who can pay attention for a long period of time through intense boredom. Someone who can turn their mind off and on at will to make sure everything is done right, but also to keep themselves from going crazy. It’s a bit like assembly line work except you’re the only person doing it. In that respect, it also helps if you enjoy being on your own 99% of the time.

What I actually do at work. (Photo by Mark Davidson.)

There are various types of digitization technology--flatbed scanners (what you probably think of when you think of a scanner), sheetfed scanners (put a stack of paper on the tray and it zips it through the machine, voila!), PhaseOne cameras, BetterLight scanning backs, specialized book scanners, plain old digital cameras, even strange homemade rigs for unique projects. All of them have their strengths and weaknesses. Flatbed scanners, for example, can capture extremely high resolution images, but can take on the order of 5-20 minutes for each image depending on just how intensely you pump up the PPI.

The rig I work with for Luna Imaging is a digital camera levelled and mounted over a flat surface. The image capture is so fast that I’ve often hypnotized myself into cycles of perpetual motion, feeling like Charlie Chaplin in the famous scene from Modern Times where he gets consumed by a giant machine. The balance between precision and psychotic break is very finely tuned in this line of work. It sounds all noble and adventurous in my head, but the reality of it is a bit more like turning your body into a robot while keeping your mind active.

There are also really cool aspects of the job. Flipping through every single page of an entire archival collection opens a little window into the past (time travel!) My current project is mostly judges’ papers from racial desegregation cases starting in the 1950s with some of them following controversies all the way in to the 1980s. A lot of the material is boring court briefings. Some of it is absolutely riveting. Here’s a short list of funny, chilling, interesting things I’ve come across.

● You could mail a letter in 1970 for six cents.

● Before anonymous internet comments, people liked to tear up newspapers, annotate them with rude things and scary drawings (like nooses and minstrel faces), and mail them to judges to express their displeasure.

● I found a December 1970 issue of Playboy in one of these boxes--presumably because it had an article in it about desegregation. You never know.

● One particularly angry citizen liked to decorate every corner of his letters and envelopes with crayon renditions of the Confederate flag. Just to hammer home his point.

● When your racist opinions are unfounded, be sure to draw your own illustrations and submit them as proof.

● Sending death threats to a judge through the mail is frighteningly common with respect to desegregation cases.

All in all, digitization is a fascinating undertaking and there’s plenty of material to keep us busy. My science fiction notions aside, what could open more possibilities than making material available for research to people anywhere in the world? To me, that accessibility--regardless of time or location--is a pretty sci-fi notion in itself, and that’s part of what keeps me excited about digitization.