The text analysis resources here cover topics such as installing computer programming languages (like R and Python), running exploratory scripts of word tokenizations and counts, and more advanced approaches like topic modeling and word embedding models.
Getting Started
- Where to Start – A guide on how to start text mining. Written by Ted Underwood of the University of Illinois, Urbana-Champaign.
- Text Analysis Introduction – Basic guide to introductory text analysis from “Tooling Up for Digital Humanities.”
- Natural Language Processing with Deep Learning – Material from Stanford University’s “Natural Language Processing with Deep Learning” course.
- Text Mining with R: A Tidy Approach – Textbook by Julia Silge and David Robinson.
- Computational Text Analysis for Social Science – Article on social science text analysis methodology usage by Brendan O’Connor, David Bamman, and Noah A. Smith of Carnegie Mellon University.
Python
- Python Programming for the Humanities – Interactive tutorial and introduction into Python programming for the humanities by Folgert Karsdorp.
- Python for Informatics – Textbook by Charles Severance that is an applied but comprehensive introductory Python text with sections on text parsing.
- Download and Install Python – Website for downloading and installing Python.
- Download and Install PyCharm – Website for downloading an installing PyCharm, an Integrated Development Environment (IDE) for Python.
- Download and Install IPython – Website for downloading and installing IPython, an interactive shell for Python programming.
- The Hitchhiker’s Guide to Python – A textbook for both novice and expert developers by Kenneth Reitz and Tanya Schlusser.
- Python Text Analysis Tutorial – A list of tutorials on Python and text analysis assembled by Neal Caren of the University of North Carolina, Chapel Hill.
- Codeacademy for Python – Online teaching resources for learning Python and other programming languages.
- Google Python Tutorial – Google’s free online Python programming class.
- Machine Learning with Sci-Kit Learn – Video tutorials for machine learning in python with the Sci-Kit Learn Package.
- Python NLP Course – Yandex Data School’s Natural Language Processing in Python Course.
- Web-Scraping Tutorial– Tutorial on Web-Scraping with Python.
- Python Text Analysis Course – Material on GitHub for Laura K. Nelson’s text analysis course.
- Python for Everyone – Open source Python textbook with exercises.
- Practice Python – Beginner Python exercises that comes with discussion topics.
R
- Text Analysis With R for Students of Literature – Matthew Lee Jockers introductory text with PDF available through NEU Library.
- Download and Install R – Website for downloading and installing the R programming language.
- Download and Install RStudio – Website for downloading and installing RStudio, an Integrated Development Environment (IDE) for R.
- RSeek – A search tool for finding resources on R programming.
- Simple Data Types in R – Information on basic data types in the R programming language.
- Humanities Data Analysis Class – Material from Ben Schmidt’s graduate seminar on data analysis for the humanities.
- Managing and Manipulating Data in R – Material from a UCLA programming with R course.
- Text as Data R Course – Material on using textual data in R by Chris Bail of Duke University.
- A Light Introduction to Text Analysis in R – An introduction and overview of text analysis tools in R.
- Data Science Specialization Course – A broad course in R covering multiple data science techniques like statistical inference, machine learning, regression models, exploratory analysis, and others.
- Computational Statistics in R Course – Material from Northeastern professor Nick Beauchamp’s Computational Social Science course.
Topic Modeling
- Journal of Digital Humanities’s Special Issue – Special issue of JDH specifically on Topic Modeling in the humanities published in 2012.
- Topic Modeling: A Basic Introduction – Introductory article by Megan R. Brett from JDH’s special issue explaining the basic concepts of topic modeling.
- Words Alone – Article on Latent Dirichlet Allocation’s (LDA’s) limitations by Ben Schmidt.
- Topic Modeling Made Just Simple Enough – An introduction to topic modeling written by Ted Underwood of University of Illinois, Urbana-Champaign.
- Guided Tour – A comprehensive guide to topic modeling with many links by Scott Weingart of Carnegie Mellon University.
- MALLET – Website for downloading and installing Mallet, an open-source and Java-based Latent Dirichlet allocation (LDA) package.
- Topic Modeling Tutorial – Tutorial by Shawn Graham, Scott Weingart, and Ian Milligan’s on setting up a command line environment for using MALLET.
- Mallet R Package – Ben Schmidt’s wrapping MALLET
- GUI Tools that use MALLET
- Google’s Topic Modeling Tool – A graphical user interface for doing topic modeling.
- Serendip – A system for visualizing topic models by Eric Alexander and Joe Kholmann of the University of Wisconsin-Madison.
- Topic Modeling Toolbox – An alternative to MALLET for LDA topic modeling from Stanford University.
Word Embedding Models
- Women Writers Vector Toolkit – Interface for querying terms in word2vec models trained on Women Writers Project corpus.
- Vector Space Models for the Digital Humanities – Blog post by Ben Schmidt which links to his R package wrapping word2vec (word2vec is written in C).
Other Text Analysis Tools and Resources
- Voyant Tools – A simple, yet powerful, web-based text analysis and visualization tool.
- Lexos – A tool for scrubbing, chunking, and tokenizing text, in addition to performing modest analysis and visualizing clusters.
- How to Create Topic Clouds with Lexos – Blog post by Scott Kleinman on using Lexos for topic modeling word clouds.
- AntConc – A GUI of concordancing and text analysis toolkit created by Laurence Anthony.
- CasualConc – A Mac OSX-native toolkit (AntConc’s Mac version is ported from the PC, and has some bugs).
- TextPlot – A Python package by David McClure that produces force-directed network of words in a text, the nodes of which are clustered using estimated kernel densities.
- (Mental)Maps of Text – Blog post explaining the concept of TextPlot.
- Textplot Refresh – Python 3, PyPi, CLI App – Blog post on downloading and setting up TextPlot.
- Literary MRIs (or, tuning Textplot) – Blog post on TextPlot’s parameters.
- Bookworm – A customizable corpus trend visualization tool.
- Word Tree – A tool that creates word trees from a block of text.
- Applied Text Analysis Course – Material from Justin Grimmer’s course on Applied Text Analysis for Social Scientists.
- Stanford CS224N: NLP with Deep Learning – Video lectures from Stanford University’s Natural Language Processing with Deep Learning from Winter 2019.