One of the most challenging aspects of digital work can be finding an appropriate data set. The data sets that can be found here are in many different formats, both structured and unstructured, numerical and textual.

Plain text

TEI/XML

Political Science Datasets:

Network Datasets:

GIS Datasets:

Resources from Professor Laura Nelson’s “Analyzing Complex Digitized Data” course

Demonstration Corpora, by Alan Liu, including:

  • U.S. Presidents’ Inaugural Speeches
  • Abraham Lincoln Speeches and Letters
  • Documenting the American South
    • The Church in the Black Community
    • First-Person Narratives of the American South (African Americans, women, enlisted men, Native Americans, ex-slaves, etc.)
    • North American Slave Narratives
  • Sunday School Books in 19th Century America
  • The Grange Visitor (Michigan newspaper)
  • Historic American Cookbooks
  • Adult British Fiction – 1880s (by gender)
  • Children’s Fiction – 1880s (by gender) (I have formatted some of these data, ask me)
  • William Wordsworth writings
  • Book summaries and film summaries from Wikipedia
  • U.S. patents related to the humanities
  • List of sites containing full text books

Springboard List of Free Datasets for Data Science

Corpora from Miriam Posner’s crowdsourced document:

Nicolas Iderhoff’s Collection of NLP Datasets

Amanda Rust’s Subject Guide on “Text and Data Mining Library Databases” (Northeastern University Libraries)