One of the most challenging aspects of digital work can be finding an appropriate data set. The data sets that can be found here are in many different formats, both structured and unstructured, numerical and textual.

Plain text

TEI/XML

Political Science Datasets:

Network Datasets:

GIS Datasets:

Resources from Professor Laura Nelson’s “Analyzing Complex Digitized Data” Course

Demonstration Corpora, by Alan Liu, including:

  • American Presidency Project – U.S. Presidents’ Inaugural Speeches, States of the Union, Campaign Platforms, and other presidential text material.
  • Abraham Lincoln Speeches and Letters – Corpus assembled by Alan Liu (see website for .zip file download link and metadata).
  • Sunday School Books in 19th Century America – Michigan State University Library’s collection of Sunday school books published between 1809 and 1887.
  • Grange Visitor – Michigan State University Library’s collection of The Grange Visitor, the official newspaper of the Michigan State Grange published between 1875-1896.
  • Feeding America – Michigan State University Library’s collection of historical American cookbooks spanning the late 18th century to early 20th century.
  • Adult British Fiction – Literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • Children’s Fiction – Children’s literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • Writings of William Wordsworth – Writings assembled by Alan Liu (see website for .zip file download link and metadata).
  • Book Summaries and Film Summaries from Wikipedia – Demo text assembled by David Bamman of the UC Berkeley School of Information (see website for .zip file download link and metadata).
  • U.S. Patents Related to the Humanities – Patents mentioning ‘humanities’ or ‘liberal arts’ between 1976-2015, located through the U.S. Patent office (see website for .zip file download link and metadata).
  • List of sites containing full text books
    • Internet Archive Books – Includes plain-text access to books, issues of magazines, etc.
    • Oxford Text Archive – A large number of texts available in variety of forms, including plain text; texts are accessed one at a time.

Springboard List of Free Datasets for Data Science

  • United States Census Data – Statistics provided by the United States Census Bureau.
  • FBI Crime Data – Time series crime data reported by FBI at national and jurisdictional levels.
  • CDC Cause of Death – Database of cause of death provided by the Center for Disease Control.
  • Medicare Hospital Quality – Database on hospital quality of care for hospitals across the United States.
  • SEER Cancer Incidence – Cancer data that can be sorted by gender, race, year, and other demographics.
  • Bureau of Labor Statistics – Important economic indicators for the United States including unemployment and inflation that can be segmented temporally or spatially.
  • Bureau of Economic Analysis – National and regional economic data including GDP and exchange rates.
  • IMF Data – International financial data from the International Monetary Fund.
  • Dow Jones Weekly Returns – Stock price weekly returns from the Dow Jones Industrial Average.
  • Enron Emails – Text data of emails from the fraudulent energy company Enron.
  • Data is Plural – Data is Plural is a weekly newsletter of “useful/curious” datasets. This google doc has the name and location of every dataset listed in the newsletter. The datasets cover everything from global foreign aid to Donkey Kong scores.

Corpora from Miriam Posner’s Crowdsourced Document:

Collection of NLP Datasets – Github repository by Nicolas Iderhoff of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.

Text and Data Mining – Subject guide by Amanda Rust on “Text and Data Mining Library Databases” from Northeastern University Library.