Abstract
The process of assembling corpora for text-mining-based Digital Humanities projects is a crucial and yet frequently overlooked aspect of the research process. Often complicated by text availability and cost, as well as legal restrictions on in-copyright text, DH scholars frequently resort to “found” corpora marketed to libraries by publishing companies or questionably sourced corpora that inhabit legal grey areas. While such corpora have led to methodological developments in the field, there is a general sense that the biases of these corpora and the inability to share their raw data have made them imperfect vehicles for large-scale critical claims in the humanities. Recent developments, however, suggest that this situation may be changing. In the United States, the 2021 text and data-mining exemption to the Digital Millennium Copyright Act (DMCA) has promised to improve the viability of bespoke corpora for text-mining research. In this paper, we put these improvements to the test, reporting on our efforts to source a relatively small corpus of literary theory monographs. Focusing primarily on born-digital works and operating under all of the practical and legal constraints dictated by the exemption to the DMCA, we sought to assemble a corpus of 402 pre-selected theoretical works. We found that, despite the recent legal changes, and even with extensive support from a well-resourced library, it remains overly difficult to assemble a pre-selected corpus of scholarly works, even under ideal financial and institutional conditions. While scholars outside of the United Stats will face somewhat different legal restrictions on the collection of electronic texts than we did, we found that many of the obstacles we faced were practical, rather than regulatory, and in many cases, we found that scanning books was the easiest and most efficient route to digital versions of the texts we sought.
| Original language | English |
|---|---|
| Journal | Digital Humanities Quarterly |
| Volume | 19 |
| Issue number | 3 |
| Publication status | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2025, Alliance of Digital Humanities Organisations. All rights reserved.
Funding
This project is supported by a Public Knowledge grant from the Andrew W. Mellon Foundation. We are grateful to Julia Gershon and Sarah Sophie Schwarzhappel, undergraduate research assistants to the Stanford Literary Lab, for their aid in the collection and verification of some metadata, and to Casey Patterson, who consulted on the selection of Black studies texts. We are also tremendously grateful for the support of the Stanford University Library system and the help of all of its staff, without which it simply would not have been possible to do this project.