One of the most challenging aspects of digital work can be finding an appropriate data set. The data sets that can be found here are in many different formats, both structured and unstructured, numerical and textual.
- Project Gutenberg – A Library of over 60,000 ebooks and texts, available as plain text and in other formats.
- Early Caribbean Digital Archive (ECDA) – An open access collection of pre-twentieth-century Caribbean texts, maps, and images.
- Women Writers Online – A full-text collection of early women’s writing in English, including full transcriptions of texts published between 1526 and 1850, focusing on materials that are rare or inaccessible.
- Eighteenth Century Collections Online–Text Creation Partnership (ECCO-TCP) – Searchable SGML/XML-encoded texts from among the 150,000 titles avilable in Gale’s Eighteenth Century Collections Online (also available as plain text).
- Early English Books Online–Text Creation Partnership (EEBO-TCP) – Nearly 1500 texts related to the temes of travel and navigation in the early modern world (also available as plain text).
- Documenting the American South – Collection of texts, images, and audio files related to south history, literature, and culture.
Political Science Datasets:
- Stockholm International Peace Research Institute (SIPRI) – Datasets on the arms-trade industry and military expenditures.
- Uppsala Conflict Data Program (UCDP) – Datasets on armed conflict that covers individual events of organized violence geocoded down to level of individual villages, with temporal duration down to single days.
- Correlates of War Project – Armed-conflict and related data on national militaries, disputes, alliances, and territorial change among other datasets.
- Global Terrorism Database (GTD) – Information on more than 190,000 terrorist attacks worldwide including date, location, weapons, target, casualties, and identifiable parties responsible.
- Global Peace Index – Dataset on peace ratings per country based on a variety of indicators including crime, militarism, arms industry, and conflict.
- Harvard’s Caselaw Access Project – Text data on written American caselaw broken down by state and federal jurisdictions.
- World Values Survey – Survey of over 100 countries using common questionnaire including time-series data on topics like economic development, democratization, religion, gender equality, and social capital.
- American National Election Studies (ANES) – Time series survey data on American national elections dating back to 1948.
- EM-DAT: The International Disaster Database – Data on occurrence and effects of over 22,000 disasters in the world from 1900 to today assembled from various sources including the UN, NGOs, the media, research institutions, and private industry.
- US Congress Bill Status XML Bulk Data – Data on the status of every bill in the United States’ Congress starting from the 113th Congress (2013-2015).
- LegiScan – Database of US State (and Washington DC) legislation including bill status.
- CountLove – Dataset of protests in the United States since 2017 broken down by state.
- Comparative Constitutions Project – Text database of global government constitutions.
- Also see: Constitute
- Systemic Peace’s global polity datasets – Annual, cross-national dataset of “patterns of authority” regime characteristics of global governments since 1800. Other datasets also available.
- Freedom House – Dataset of annual global reports on political rights and civil liberties.
- Quality of Government – Dataset constructed of over 2,000 variables on global government quality in policy areas like health, environment social policy, and poverty.
- UN Human Development Index United Nations’ dataset on country-level measures of health, education, and economics.
- World Bank open data – Multiple open datasets of World Bank statistics.
- SNAP Stanford Large Network Database Collection – Collection of multiple network datasets including from social media like Facebook, LiveJournal, and Twitter.
- UC Irvine Network Data Repository – Collection of network datasets used in previously published scholarly articles.
- Duke Network Analysis Center – Database of network dataset repositories.
- networkrepository.com – Database of network data with interactive data visualization and analytical tools.
- USGS GIS data – Source of mapping data provided by the United States Geological Survey.
- GIS shapefiles database – Repository of mapping data from a variety of sources and agencies.
- Boston Housing Data – Mapping data specific to Boston and its suburbs.
- United States Census Data – Geocoded data from the United States Census Bureau.
Resources from Professor Laura Nelson’s “Analyzing Complex Digitized Data” course
Demonstration Corpora, by Alan Liu, including:
- U.S. Presidents’ Inaugural Speeches from the American Presidency Project
- Abraham Lincoln Speeches and Letters assembled by Alan Liu
- Sunday School Books in 19th Century America from the Michigan State University Libraries Text Collection
- The Grange Visitor (Michigan newspaper) from the Michigan State University Libraries Text Collection
- Historic American Cookbooks from the Michigan State University Libraries Text Collection
- Adult British Fiction – 1880s (by gender)
- Children’s Fiction – 1880s (by gender)
- William Wordsworth writings assembled by Alan Liu
- Book summaries and film summaries from Wikipedia
- U.S. patents related to the humanities
- List of sites containing full text books
- United States Census Data – Statistics provided by the United States Census Bureau.
- FBI Crime Data – Time series crime data reported by FBI at national and jurisdictional levels.
- CDC Cause of Death – Database of cause of death provided by the Center for Disease Control.
- Medicare Hospital Quality – Database on hospital quality of care for hospitals across the United States.
- SEER Cancer Incidence – Cancer data that can be sorted by gender, race, year, and other demographics.
- Bureau of Labor Statistics – Important economic indicators for the United States including unemployment and inflation that can be segmented temporally or spatially.
- Bureau of Economic Analysis – National and regional economic data including GDP and exchange rates.
- – International financial data from the International Monetary Fund.
- Dow Jones Weekly Returns – Stock price weekly returns from the Dow Jones Industrial Average.
- The now famous Enron Emails Text data of emails from the fraudulent energy company Enron.
- Data is Plural’s google doc – Data is Plural is a weekly newsletter of “useful/curious” datasets. This google doc has the name and location of every dataset listed in the newsletter. The datasets cover everything from global foreign aid to Donkey Kong scores.
- HATHITrust – Database of 16 million volumes, mostly in English.
- Chronicling America – Database of 12.8 million pages of American newspapers.
- Perseus Digital Library – Large collection of classical texts, much of it encoded in TEI/XML.
- Old Bailey Online – Collection of 197,745 London criminal trials, 1674-1913.
- Canadian Hansard – Database of debates & journals of the Canadian Senate & House of Commons.
- Australian Hansard – Database of Australian Parliamentary debates, 1901-1980.
- UK Hansard – Database of UK Parliamentary debates.
- Open Islamicate Texts Initiative – See also repositories – 10,000 premodern Islamicate texts.
- Transkribus Corpus and READ – Efforts to use computer vision to recognize handwriting.
- ToposText – Database of 557 classical texts linked with a gazetteer of the ancient world.
- BYU Corpora – Widely used corpora of American English
- Wright American Fiction – Database of American adult fiction, 1774–1900.
- UCLA Broadcast NewsScape – Database of 170K hours of captioned news programs; see Red Hen Lab for information on access.
- Media History Digital Library – Database of nearly 2 million pages of media-related books and articles, 1875-1995.
- Christian Classics Ethereal Library – Database of classic Christian texts.
- NYT Annotated Corpus – Database of 1.8 million New York Times articles and New York Times-supplied metadata.
- Europeana Collections – Repository of many datasets from European libraries & archives, from papyri to photographs to newspapers.
- Foreign Records of the US – Database of nearly complete run of Foreign Relations of the United States; see these tools to obtain full text.
- Internet Archive – Collection of websites, texts, audio, and other media, available for bulk download via wget.
- Twitter Datasets – A catalog of Twitter datasets that are publicly available on the web.
- BitCurator – Effort to develop tools to analyze features of digital texts.
- Movie Quotes Corpus – Database of 220,579 conversational exchanges between 10,292 pairs of movie characters.
- Europe PMC – Repository of life sciences books, articles, and preprints.
- Trove Australia – Database of 565 million documents collected by the National Library of Australia, including a sizable collection of newspapers.
- BNC-Baby – Dataset of 4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML.
Nicolas Iderhoff’s Collection of NLP Datasets – Github repository of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.
Amanda Rust’s Subject Guide on “Text and Data Mining Library Databases” on Northeastern University Library webpage.