“Data Envy” at MLA 2016 – Diane K. Jakacki

This is the transcript of the short paper I gave as part of the “Digital Scholarship in Action: Research” panel at MLA 2016 in January . The attendant PowerPoint is stored and indexed on the MLA Commons Open Repository Exchange, and is available here: https://commons.mla.org/deposits/item/mla:667/

“Data Envy: Or, maintaining one’s self-confidence as a digital humanist at a time when everyone seems to be talking about … Big Data”

SELF-CONSCIOUS: Perhaps I’m being overly self-conscious, but lately I’ve felt increasingly out of the loop in terms of DH discourse – namely because I don’t do big data. Or at least I don’t think I do. And I observe that discussions about DH invariably involve topic modeling and pattern recognition and linked data and large-scale data visualization and “bags of words”.

TOO FEW NOTES: I work with lots of words, but I don’t think I have a bag-full. In fact, David Hoover once told me I can’t do proper text analysis – even with all of Shakespeare’s plays. To flip Emperor Joseph II on his ear: too few notes.

SMALL DATA? My text mining involves spelunking into what I think is semantic inline markup and precise linked data – primarily in my edition of Henry VIII for the Internet Shakespeare Edition and associated content in other ISE editions as well as the Map of Early Modern London.

CONFLICTED DATA: What I want to discuss here is the work we’re beginning to do with the Records of Early English Drama – in fact, this week we’re having an intensive work session to figure out what we really mean when we talk about turning REED into a DH resource.

WHAT IS REED? REED is in its fourth decade, and is a remarkable and unparalleled resource for theatre and performance historians. Rather than focusing on play texts it has from its inception focused on archival documents that somehow relate to performance. It has published 23 collections in some 34 volumes – over 17,000 pages of references dating from 1100-1650. That doesn’t take into account thousands of incremental pages of editorial, bibliographical, and linguistic content produced by dozens of theatre scholars. These collections are organized by county or by city or by category: there is a Dorset collection, a York collection, an Ecclesiastical London collection, etc.

WHAT IS REED ONLINE? Since 2004 REED has experimented with ways in which discreet aspects of its record set might be presented online. First with Patrons and Performances, and more recently with Early Modern London Theatres, REED has looked for ways to share bibliographical information with researchers. For years we’ve tried to find ways to extend that online presence, and in effect transport REED from a print to an online resource. Our objective has been to ensure that REED will continue to increase its value to scholars and students who are trying to work more dynamically with information that even with the best indexical practice is untenable across collections. At the same time, REED editors have been experimenting with born-digital projects that extend and augment the printed work that has already been completed.

BIG CORPUS ≠ BIG DATA: With 17,000 pages of records spanning over six hundred years, this would seem to constitute a big data project and I should feel better about myself. I can play with the cool kids. But I think REED is not big data. It’s a big corpus … but if you look at the records with which we work, there are few commonalities beyond their relationship to some form of performance. Records range from payments for performance to costume lists to legal ordinances and law suits to contracts … In these two cases (both from the Bristol collection) you see an Ordinance of the Common Council in 1596 imposing a fine on any mayor who allows players to perform in the guildhall – and a payment of twenty shillings made to the Queen’s Men in 1585 NOT to perform.

This fall I was able to hand-scrape 500 records of performance for the decade 1580-90, and the data available to me was exactly this: performance troupe, location, record (not necessarily performance) date, place, payment. I finally got a data dump of the Patrons and Performances Drupal site (not easy) and ended up with 2,600 records covering 1100-1650. That’s a partial dataset (London is not yet included), but I don’t think it is going to get too much bigger.

WHAT DO WE DO? How do we resolve the conflict between small data and big data? Perhaps more worrisome is how do we efficiently make the legacy print data (much of which will have to be re-digitized and is not OCR-able) line up with born-digital data?

We have to micro-manage our data – we have to be precise, intentional, and deliberate because the decisions we make now will not be easy to overhaul as we get too far into the process.

RETROFITTING / ANTICIPATING: We have to stabilize our data before we can release it into the wild. Stabilizing it means we actually need to go back and deconstruct REED’s editorial process. We have to respect editorial, bibliographical and indexical processes that have served as best practice for 40 years while customizing a schema that will encompass all of the disparate types of data. We actually need to figure out what constitutes *a* record and determine how to associate all of the existing editorial content from both the legacy-print editions and the born-digital projects. We also need to decouple the existing databases (P&P, EMLoT) from their current publication platforms. Drupal is not our friend.

ANTICIPATION: We need to do this so we can “publish forward” integrating the legacy-print and born-digital content into an open-access online environment. That means (I believe) enabling data customization and open export for a wider array of medieval and early modern scholars (e.g. those interested in law). By doing so we will be able to create moments for digital interlocution – if not fully linking data – across projects and resources.

PROVOCATION:

So that leads to the question I would pose to my fellow panelists and the audience when we get to discussion: is this kind of small data valuable in its own right?
Or in light of how DH is trying to find ways to integrate and associate and aggregate
Is REED only now valuable because of its relationship to larger and more “traditionally” considered big data sources?