Google’s N-Gram Viewer Update

Over on the Language Log blog, I saw a post by Ben Zimmer about “A New Chapter for Google NGrams,” which inspired me to check out the updates (also documented in a post by Jon Orwant on the Google Research blog).

Since my husband has been spending some spare time memorizing and documenting Racine’s “La mort d’Hippolyte” from Phèdre, I thought I’d look at the results for that. For starters, I used the phrases “La mort d’Hippolyte, Hippolytus, Phaedra,” hoping to get a sense of the relative mentions of the works by Racine, Euripides, and Seneca in the French corpus from 1800-2000.

NGram for terms "La mort d'Hippolyte, Hippolytus, Phaedra," date range 1800-2000. 2012-10-18

Search in Google’s NGram Viewer for “La mort d’Hippolyte, Hippolytus, Phaedra,” date range 1800-2000, corpus French, on 2012-10-18.

Although it’s not clear from the thumbnail, there is an interesting spike for the phrase “La mort d’Hippolyte” in the dates of 1824-1830. Wanting to see this more clearly, I restricted the search to 1800-1840.

NGram for terms "La mort d'Hippolyte, Hippolytus, Phaedra," date range 1800-1840. 2012-10-18

Search in Google’s NGram Viewer for “La mort d’Hippolyte, Hippolytus, Phaedra,” date range 1800-1840, corpus French, on 2012-10-18.

I then looked at the Google Books results for “La mort d’Hippolyte” in the French language for 1820-1830. These revealed that the majority of the results were in regards to a poisoning case that must have been famous in its time, that of the murders of the brother Hippolyte and Auguste Ballet by the physician Edme-Samuel Castaing in 1822 and 1823, respectively. There’s an English Wikipedia page about Castaing, which notes that this is thought to be the first instance of murder by morphine.

The new NGram viewer has advanced search options, documented at http://books.google.com/ngrams/info. I’m experimenting with some of these to try to refine my results. In this example, I tried searching for Phedre, Hippolyte, and the combination of Phedre + Hippolyte, which gives me a clearer impression of the number of times Hippolyte is mentioned relative to Phedre (since the total numbers are very small).

NGram for terms "Phedre,Hippolyte,(Hippolyte + Phedre)," date range 1800-1840. 2012-10-18

Search in Google’s NGram Viewer for “Phedre,Hippolyte,(Hippolyte + Phedre)” date range 1800-1840, corpus French, on 2012-10-18.

The effect is smaller than I’d hoped, and doesn’t really give me a way to narrow in on searches where Phedre is mentioned in combination with Hippolyte — it’s just a mathematical combination of the two result, which could be useful for synonyms and alternate phrases. I will continue to experiment.

Omeka Plugins: Solr, BagIt, FedoraCommons

Omeka keeps becoming better and better. The Scholars’ Lab at the University of Virginia Libraries has released two three new plugins for Omeka, which is a project of the Roy Rosenzweig Center for History and New Media, George Mason University. Thanks to Charles Bailey at Digital Koans for the news (which got updated while I was writing this post!).

The first announcement is about the SolrSearch plugin, which replaces Omeka’s default search with Solr. It’s been out in beta for a while, so I’m happy to see this one at release 1.0. According to the announcement, SolrSearch indexes Simple Pages and Exhibits. I’ll have to try installing it to see whether it also searches Collections and/or Items (based on the available documentation, it appears that it does) and to see how it integrates text and images in the results.

The second announcement is about a BagIt plugin, which “allows users to (a) generate and export Bags containing files on the site and (b) import Bags and make their contents available through the Dropbox interface.” I’ll store that one for future reference.

The third announcement is about a FedoraConnector plugin, which allows Omeka to pull in information from a FedoraCommons repository. This will provide a way for people to keep their content controlled: use FedoraCommons to provide the repository, complete with preservation workflows, and present the content via Omeka in a more user-friendly format. Very nice.

Best Practices for TEI in Libraries (v3)

The TEI Special Interest Group on Libraries has released version three of the Best Practices for TEI in Libraries: A Guide for Mass Digitization, Automated Workflows, and Promotion of Interoperability with XML Using the TEI. The introduction specifies five levels of practice:

There are many different library text digitization projects, serving a variety of purposes. With this in mind, these Best Practices are meant to be as inclusive as possible by specifying five encoding levels. These levels are meant to allow for a range of practice, from wholly automated text creation and encoding, to encoding that requires expert content knowledge, analysis, and editing. The encoding levels are not strictly cumulative: while higher levels tend to build upon lower levels by including more elements, higher levels are not supersets because some elements used at lower levels are not used at higher levels—often because more specific elements replace generic elements.

One of my disappointments with my time working on the Texas Heritage Online program was that so few libraries in Texas used TEI at any level. Most of the text digitization was at Level 1, at best (which is to say that there was no markup at all, just searchable text behind page images).

Digital Humanities Text Analysis Tools

Lisa Spiro has a great post on her blog, showcasing various Digital Humanities resources and tools. One of the resources I hadn’t been aware of is the Text Analysis Developer’s Alliance, or TADA, and TaPoR (Text Analysis Portal for Research). Note that the portal is being redesigned and the URL may change in future. Since I have a feeling that my next job will involve a lot of text analysis, this was a fantastic find. I also really like the DiRT list of text analysis tools (again, this site is due to be redesigned and URLs may change). Thank you Lisa!

Media Preservation (and Digitization)

This was an interesting announcement posted to the Archives & Archivists list. I’ve been interested in digitization and preservation (two separate things, though often conflated) of audio-visual materials for a couple of years now, though I haven’t had many opportunities to practice it.

Indiana University Bloomington announces the release of a detailed report entitled “Meeting the Challenge of Media Preservation: Strategies and Solutions.” This 128-page report is available for download at http://www.indiana.edu/~medpres/

“Meeting the Challenge” is the result of a year of research and planning by a campus-wide task force charged with addressing the problems identified in the earlier IU Bloomington media preservation survey report published in 2009. “Meeting the Challenge” explores a range of topics related to the preservation and conservation of audio, video, and film, including: guiding preservation principles, facility planning, prioritization, digitization methodologies, strategies for film, principles for access, technological infrastructure needs, and engagement with campus units and priorities. Although developed specifically for the Bloomington campus, the findings and analyses in “Meeting the Challenge” may be useful to universities and other organizations with media holdings.

While conversion into a digital format is a very good way to preserve the intellectual content of audio-visual material, and is thus a “preservation strategy,” I am still concerned about losing the artefactual value of the original items. This report addresses physical storage of film materials, acknowledging that preservation issues can be largely addressed here through appropriate storage, but for the majority of the media types addressed it considers only digitization. I understand why — digitization meets the preservation goal, in large part and also makes access easier for most users — but I’d still prefer an approach that includes both strategies for all materials.

Top