One of the most challenging aspects of digital work can be finding an appropriate data set. The data sets that can be found here are in many different formats, both structured and unstructured, numerical and textual.

Plain text

TEI/XML

Political Science Datasets:

Network Datasets:

GIS Datasets:

Resources from Professor Laura Nelson’s “Analyzing Complex Digitized Data” course

Demonstration Corpora, by Alan Liu, including:

  • U.S. Presidents’ Inaugural Speeches from the American Presidency Project
  • Abraham Lincoln Speeches and Letters assembled by Alan Liu
  • Sunday School Books in 19th Century America from the Michigan State University Libraries Text Collection
  • The Grange Visitor (Michigan newspaper) from the Michigan State University Libraries Text Collection
  • Historic American Cookbooks from the Michigan State University Libraries Text Collection
  • Adult British Fiction – 1880s (by gender)
  • Children’s Fiction – 1880s (by gender)
  • William Wordsworth writings assembled by Alan Liu
  • Book summaries and film summaries from Wikipedia
  • U.S. patents related to the humanities
  • List of sites containing full text books
    • Internet Archive Books – Includes plain-text access to books, issues of magazines, etc.
    • Oxford Text Archive – A large number of texts available in variety of forms, including plain text; texts are accessed one at a time.

Springboard List of Free Datasets for Data Science

Corpora from Miriam Posner’s crowdsourced document:

Nicolas Iderhoff’s Collection of NLP Datasets – Github repository of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.

Amanda Rust’s Subject Guide on “Text and Data Mining Library Databases” on Northeastern University Library webpage.