REED and the Prospect of Networked Data at CSRS 2016

This is the transcript of a long paper I gave as part of the “Digital Scholarship in Action: Research” panel at CSRS (Canadian Society for Renaissance Studies) in Calgary on May 30, 2016. The attendant PowerPoint is stored and indexed on the MLA Commons Open Repository Exchange, and is available here: http://dx.doi.org/10.17613/M6CK59

“REED and the Prospect of Networked Data”

At the MLA in January I gave a short paper entitled “Data Envy” – a contemplation of my inferiority complex with regards to scholars who have massive corpora to work with – Moretti-sized data. I reflected on the fact that the type of research with which I’m usually involved relies on close reading of texts and maps – and at the very most I’ve been able to work with is 2,500 records. I’ll get back to that in a moment, but I’d just like to say that I ended that short talk with a provocation – one that I’d like to use as the jumping off point for this paper: in today’s DH environment, where big data and linked data are increasingly the focus of scholars looking for ways to extend their research questions through more expansive and complementary datasets, what is the role of the individual research project? Is its value now truly in its integration and association and aggregation with other datasets?

As with the MLA paper I’ll focus my comments today on the Records of Early English Drama, although my association with the Internet Shakespeare Editions and the Map of Early Modern London project also inform my participation in collaborative, pan-project initiatives such as the Early Modern Social Network partnership and Linked Early Modern Drama Online. But I would like to extend my reflection to consider intra-operability as well as interoperability. Because while REED might not constitute what in some circles is called “big data”, it is a dense and complicated corpus – or, perhaps I should say subject-driven corpora.

REED is in its fourth decade, and is a remarkable and unparalleled resource for theatre and performance historians. Rather than focusing on play texts and playwrights, from its inception it has focused on gathering and publishing archival documents that bear some relation to performance. The REED editorial team has published 23 collections in some 34 volumes – by my count over 17,000 pages of references dating from 1100-1650. That doesn’t take into account thousands of incremental pages of editorial, bibliographical, and linguistic content produced by dozens of theatre scholars, or the records in the pipeline awaiting publication. These collections are organized by county or by city or by category: there is a Dorset collection, a York collection, an Ecclesiastical London collection, etc.

Since 2003 REED has explored ways in which discreet aspects of its record set might be presented online. First with Patrons and Performances, more recently with the Early Modern London Theatres database and its associated “How to Track a Bear in Southwark” site, and soon with Staffordshire – the first born-digital collection, REED has looked for ways to share bibliographical information and more with theatre historians. That experimentation has proved prescient, as the expense of purchasing the ever-expanding set of “red soldiers” – as the print volumes are called – is proving beyond the budget of an increasing number of research libraries. Currently the only alternative to purchasing the volumes (at least some of them) is to access them through the Internet Archive, which, while necessary, is not the most efficient way to use REED’s resources.

For years many of us associated with REED have tried to find ways to extend that online presence, and in effect transport REED from a print to a completely online resource. This task should be more … manageable (?) because REED has always ever been used as the basis for scholarly research – these are not editions in the way that ISE publishes digital editions of play texts. But I hesitate to call the process manageable because of its denseness and the complexity of the records, their associated data and metadata, as well as the caliber and rigor of REED’s editorial process, all of which we are committed to maintaining as we shift from the print to the digital.

As with any long-range, large-scale research project, REED has been intentional with its approach to data gathering and organization. I confess to not having been through the intense experience that REED field editors have – of working through 14th century guildhall receipts, or 15th century parish registers, or 16th century assizes transcripts, negotiating the paleography, tracking the sources, the owners, and the scribes, describing the condition of the materials. But I understand that while there are established standards and processes, the editors’ experience in working through the sources must be reflective of and responsive to the scope and focus of these documents. So as we look at how to manage the transition to a digital resource we need to take these things into account:

Ensure that our interpretation of what constitutes a record remains inclusive of source types rather than presumptuous that one size fits all.
Maintain the connection between archival record, metadata, and the original editorial decisions made about the source material.
Establish data structures that align born-digital with legacy-print records and related material.
Articulate a “data management plan”, for want of a better term, that focuses on the data rather than the delivery – in other words, to resist the temptation to “publish” in a traditional sense.

All of this means that it is time to shift from an experimental to an intentional mode. I think, although I expect that some would disagree with me, that we need to go back to the beginning in order for us to move ahead.

We need to come to an agreement about what constitutes a “record” in a way that accommodates the various sources and types of records.

To some of us these are individual records – a specific item – and while its complexity and length may vary from item to item it refers to one transaction, act, or event.

To others of us this is a record – a set of items compiled from one source, such as a household account ledger, a register book covering a particular period in a parish’s history, the business of producing a series of mystery plays in a particular city, etc. Neither is right or wrong (although I lean toward the more granular item level because I think it allows for more nimble and dynamic uses of the data.)

Regardless of which we choose we need to commit to that decision for all “records” and at the same time articulate a very structured metadata schema, as well as a process for relating specific editorial, bibliographical, and linguistic work.

This is of particular importance right now to the legacy print volumes, which include significant editorial materials that should remain available in dynamic ways both in line with the records as well as in connection with other “templated” material [such as …]

We need to stabilize all of the records and related materials so that they will be transferable as data repositories and encoding languages evolve. Some of us are also interested in the possibility of remixing – if a scholar wants to access a set of records focusing on the 1580s, or Robin Hood plays, or troupes that performed only in the North and West Ridings, she should be able to access that set.

We must find ways to store and share this data outside of particular content management systems. I have particular experience with this.

The Patrons and Performances website constitutes a subset of records that document a performance transaction between a patron (through his or her patronized troupe), an event (via contract or one-time payment), at a location (on a particular date or over a series of dates). This information was parsed from print REED volumes, and is designed to drive the user back to a particular page in a particular volume for the full record. About two years ago the website was rebuilt in Drupal, which accommodates search and some ancillary content such as photographs of performance sites. But in this case, for efficiency’s sake, the University of Toronto built one database within one Drupal instance to store not only REED data, but also that associated with CanWest, xx, and xx. So REED data (patron name/ID, location name/ID, date(s), payment, page number in REED volume – is all stored within the same tables as those other projects that are supported by the UT libraries.

I discovered this last fall when I asked my REED colleagues for access to data focused on touring troupes in the 1580s. I was sent a zipped file of flat files sucked out of Drupal. The data required a lot of cleaning and probably an unnecessary amount of time on my part, but I ended up with a joined table of 2,600 touring data (not full records) covering 1100-1650, and was able to use that in my DH methods course for some straight-forward data and spatial analysis. While frustrating, this was something of a victory because I was able to demonstrate to the REED board how valuable the release of clean raw data could be for pedagogical as well as research purposes. The welding of the Patrons data to a particular content management system, and the excision of what is actually metadata from the record itself showed them how we are barring rather than offering access to scholars.

I discovered this again this spring when we began to discuss the future of EMLoT (Early Modern London Theatres). EMLoT is currently a database of references to pre-1642 London theatre in publications printed between 1642 and __. It doesn’t include the references themselves so much as citations for the references. This is the only REED content that I know of that focuses on evidence outside of REED’s defined scope. But it could be brought in line with the increasing amount of data being compiled about performances and performance spaces in London and vicinity. An enhanced EMLoT (I think we’d be at 3.0 by now) could finally offer a multi-faceted view of London performance – Civic, Ecclesiastical, Royal, Inns of Court, Purpose-built theatre, inns, taverns, etc.

However, EMLoT is currently published in a customized Django-based interface that is out-of-date, no longer supported, and in need of a new server home.

So we can either spend significant resources bringing EMLoT up to Django code, or we can disassemble it, stabilize that data in line with the other of the REED records, and focus on REED as a deep data pool into which we can pour standardized record data and metadata and from which we can pull dynamically organized evidence sets.

We have to stabilize the data before we can release it into the wild. Stabilizing it means we actually must go back and deconstruct REED’s record-ing process – making that process more efficient at the same time that we respect editorial, bibliographical, and indexical processes that have served as best practice for 40 years. We have to make the legacy-print records and editorial content line up with born-digital files (much of the early printed texts will need to be re-digitized, because we no longer have pre-print document files and the Internet Archives PDFs are not OCR-able.

So what does all of this have to do with early modern social networks and data analysis? We at REED need to get our own data house in order before we can be serious about entering into larger, interlinking collaboratives and initiatives like EMSN or LEMDO. We have to micro-manage our data – we have to be precise, intentional, and deliberate because the decisions we make now will not be easy to overhaul as we get too far into the process. We have to ensure that our data is intra-operable and that we know what we have to offer before we can begin to work with our colleagues to harvest prosopographical and toponymic data.

One of the first questions in my mind is how do we manage entities? I think in terms of semantic mark-up, so … who gets a ? Who gets an xml:id? I expect there would be a discussion about priorities: in terms of value for a prosopography royalty, nobility, and aristocracy score highest on the personography scale. And of course we’re interested in personages related to performance. But what about ecclesiastical hierarchies? Is the parish priest important? What about the legal realm? Judges, almost certainly, but witnesses or claimants in a court case? I have no idea how many people are identified by name in REED’s 17,000 pages of records. Who is going to enter that information into an entity field? How specific should we be? The same goes for places. I assume that urban locations (and lat-longs where possible) have the highest value, but I want to track down the sites that no longer exist. I assume others do, too. I also want to break the county-prescribed spines of the red soldiers and seek out the liminal places between counties. In this case we need to think in terms of gaps in REED that will be filled by others so that our datasets – if not fully interoperable, at least offer moments for warping and weaving.

I am convinced that REED needs to follow this path, although I really have no concept of time or resources or who else shares my vision. But I do know that REED’s future lies in its reconstitution and its availability to scholars and students beyond those of us who focus on performance history.