Digital Collections – Keeping Current

In the area of digital collections, things change fast. I’m working on a list of resources that I have found to be useful,  including organizations (many of which I belong to), places to go for more information (email lists, websites, blogs, and so on), and other links that I can expand as time goes on. I’d welcome additional thoughts from others in the community. How do you keep up with this dynamic field?

Personal Digital Archiving

On Friday, October 18, 2013, Lauren Goodley from Texas State University and I did a presentation about Personal Digital Archiving for the Texas Library Association District 3 Annual Meeting. More accurately, Lauren did a presentation about Personal Digital Archiving, based on materials from the Library of Congress’ DPOE program, and I followed up with some resources about hosting Personal Digital Archiving Day events. The handouts and other resources are available at https://www.dcplumer.com/resources/handouts/personal-digital-archiving-resources/.

Pulling this presentation together was interesting. Lauren had done a similar presentation for me before, as part of the Connecting to Collections Caring for Digital Materials webinar series, but the focus there was more on institutional digital archives. We were hampered in our plans to put together material on Personal Digital Archiving by the fact that the federal government shutdown of 2013 meant that most of the Library of Congress resources were not available. I spent a lot of time tracking down copies of their handouts and videos to include in the resources list. Naturally, given the amount of work I had to do to find the alternate versions, the website was back up and running before our presentation. I’ve included links for both the versions on the Library of Congress website and the alternate versions, just in case.

A few things of particular interest that I discovered while pulling all of this together:

I hope that more organizations will publish versions of their resources in other languages; Digital Preservation Europe is one of the better resources, but their funding ended in 2009, and some of their materials are already a bit dated. The complete list of DPE briefing papers is available at http://www.digitalpreservationeurope.eu/publications/briefs/.

I also signed up for an “If This, Then That” (IFTTT) account for the first time. At first, it did make me a little uneasy to authorize IFTTT to use the various social media channels on which I have accounts (Twitter, Facebook, LinkedIn, Pinterest, Tumblr — they support many more). However, the ability to save tweets, posts, pins, and whathaveyou to Dropbox or Google Drive or other accounts, where I can then create a personal (i.e., non-cloud) backup is very nice. Only time will tell how secure the IFTTT service is, but you do have to wonder who else may be accessing your data. I do recommend that you read their privacy policy (https://ifttt.com/privacy) before you decide to test out any of their recipes (see http://www.marcus-povey.co.uk/2013/08/01/reconsidering-ifttt-in-the-light-of-snowden/ for more musings on this).

Crowdsourced Georeferencing

I’m always intrigued by the power of “the crowd” and curious about what sorts of projects get widespread crowd support. Maps seem to be high on the list — see projects like OpenStreetMap and Google’s North Korea mapping project. For historic maps, the Georeferencer Project from the British Library seems to be getting a lot of support: “The last time the British Library undertook such a project the response from the public was remarkable, with 708 maps completed in less than one week” (Crowdsourcing the Past“, via InfoDocket).

For the curious, georeferencing is necessary because not only is the world not flat, it’s not perfectly round. Most maps and images of terrestrial features are flat, however, and figuring out how to make them round again is an interesting problem.  My brother did a neat project over a year ago in which he wrote some code to allow people to create three-dimensional planets using images processed for cylindrical projection for NOAA’s Science on a Sphere project. Ross’ project took those projections and converted them to printable cut-and-fold images that you can assembly into icosohedra (20-sided objects), close enough to spheres that you get the idea. In other words, you go from:

Moon (cylindrical)

To:

Moon (icosohedral)

Moon (icosohedral projection)

Ross already had corrected cylindrical projections to work with. Had he had just individual images of small areas of the planets to work with, the task would have been much more difficult, as the images would have needed to be georeferenced with respect to the surface of the planet.

For maps of Earth, georeferencing generally requires the use of known latitudes and longitudes for reference points. These reference points are often listed in gazetteers, such as GNIS. Modern maps tend to include information about the coordinate system and projection used, which makes georeferencing them more straightforward: input one reference point (often listed in the map legend) and the projection, and you’re good to go. For older maps, you need to find multiple reference points to correct for the type of projection used, which you may or may not know (and even if you know it in theory, it might not be correct in the actual map you’re working with). Note: This overview is highly simplified; if you want to know more, I have listed some resources below.

Digitization projects usually use either GIS or Google Earth KML for georeferencing. If you most want a pretty display for end-users, Google Earth integration is generally faster and cheaper. If you want a system for researchers, then GIS tends to be preferred (though GIS data can be converted to and extracted from KML with various tools). The British Library’s project uses the Georeferencer product from Klokan Technologies, also available in an open-source version. Users compare two maps, clicking on reference features in first one map which has already been georeferenced and then in the historic digitized map. When enough common features have been identified, the map is saved and a 3D projection is available in Google Earth.

Resources

 

Vatican Library Digitization Project

Pal. lat. 5, 48v

Christ with the Four Evangelists, MS Pal. lat. 5, 48v

The Vatican Apostolic Library has uploaded the first 256 manuscripts from its ambitious digitization project. They are projecting that there will be 80,000 manuscripts total (from Rome Reports via InfoDocket).

Aside from the “using technology developed by NASA” line in the report, there isn’t a lot of information available about how this project is being done. A 2010 press release suggests that they’re using either a Metis Systems scanner or a 50-megapixel Hasselblad camera for the actual imaging and that the NASA technology in question is the FITS image format (which is used by astronomers but by few others). For the online presentation, it looks like they’re using DWork, the Heidelberg Digitization Workflow system (which is probably why most of the rollover text is in German). I like the “scrolling view”, which reminds me of browsing through a reel of microfilm, although I could not figure out how to get out of this view when I tried. Right-clicking on a page image in scrolling view brings up some metadata, though it’s fairly limited (and in German, once again).

I couldn’t resist using an image; here’s one  of Christ with the four evangelists, from Novum Testamentum ; Liber Psalmorum, MS Pal. lat. 5, 48v.

 

ETAOIN SRHLDCU

Peter Norvig has published a paper called “English Letter Frequency Counts: Mayzner Revisited” in which he uses the Google English language corpus to update results from Mark Mayzner’s research into word and letter frequencies back in the 1960’s. The title of this post is the second subtitle of the paper and gives the order in which letters most often appear in English. According to Norvig, “Note there is a standard order of frequency used by typesetters, ETAOIN SHRDLU, that is slightly violated here: L, R, and C have all moved up one rank, giving us the less mnemonic ETAOIN SRHLDCU.”

In his paper, Norvig updates Mayzner’s work. In a comment, he notes that “The major difference [between his work and Mayzner’s] is that when he [Mayzner] reports a count of 0, you can’t tell if that means 1 in 100 thousand, or 1 in 100 billion. With my results, you can differentiate these cases pretty well. The ngrams with high counts (like the top 50 bigrams) remain fairly consistent.”

To correct for bad OCR (although he doesn’t say that explicitly), Norvig notes that “I discarded any entry that used a character other than the 26 letters A-Z. I also discarded any word with fewer than 100,000 mentions.” Apparently, a few non-English words slipped in anyway, but this is a good rule of thumb to remember when using the Google English language corpus.

There are a few things that I’d love to have the time to dig in to a bit more. For example, see this chart, showing the frequency at which letters occur at particular positions in words:

Norvig, Peter. 2012. English Letter Frequency Counts: Mayzner Revisited. Accessed 1/9/2013 at http://norvig.com/mayzner.html

Norvig, Peter. 2012. English Letter Frequency Counts:
Mayzner Revisited. Accessed 1/9/2013 at http://norvig.com/mayzner.html

Looking at first letters, “x” “q” and “z” have non-zero probabilities of occurring in that position, as one would expect. The relative frequencies are not quite what I’d expect, though, and a deeper drive into the data would be interesting. Is this a bad OCR effect that the data cleaning didn’t eliminate? Or is something else going on?

All in all, an interesting paper, and a fun exercise. For the librarians out there, Mayzner’s original research paper was published as:

Mayzner, M. S., & Tresselt, M. E. “Tables of single-letter and digram frequency counts for various word-length and letter-position combinations.” Psychonomic Monograph Supplements. 1. (1965): 13–32.

Top