On October 14-15, NULab faculty members Nick Beauchamp and David Smith Co-hosted the seventh annual “New Directions in Analyzing Text as Data” conference. The premier cross-disciplinary gathering to discuss developments in text as data research, this two-day conference brought over 100 scholars to campus. Previous conferences have taken place at Harvard University, Northwestern University, the London School of Economics, and New York University. This is the second in a series of posts from the NULab community, responding to the event. Its author, Stefan McCabe, is a NULab fellow and is pursuing a Ph.D. in Network Science at Northeastern University. 

One of the cooler events at Text As Data was Brendan O’Connor’s presentation on phrasemachine, an R/Python library for noun phrase extraction. Deciding how to represent the features of a text corpus is an important decision. The most common approach is to strip word order and represent a document as a “bag of words.” This works well in many cases but it does mean that noun phrases get decomposed into their component words. In certain contexts this can be undesirable, but the most common alternative—treating each combination of sequential words (n-grams) as features—is unwieldy and adds a lot of unnecessary fragmentary phrases to the term-document matrix. The phrasemachine library uses part-of-speech tagging to extract noun phrases, which are the phrases most likely to be of interest for analysis.

I wanted to play around with it myself, so I spun up an IPython notebook (unfortunately, the library only supports Python 2):

from phrasemachine import phrasemachine
import requests
from bs4 import BeautifulSoup

I like working with Supreme Court opinions, which abound with noun phrases. Using unigrams can be somewhat problematic in this case, because for example “first,” “amendment,” and “First Amendment” are all quite different concepts.

I decided to see how phrasemachine would handle the syllabus for this opinion. To make things a little easier, I’ve already cleaned and isolated the first paragraph of the syllabus.

syllabus_page = requests.get('http://pastebin.com/raw/snYqmK63')
syllabus_soup = BeautifulSoup(syllabus_page.content, "html.parser")
syllabus = syllabus_soup.get_text()

Let’s look at all the phrases in this document. Note that the tagger I use, spacy, requires you download about 1GB of tools beforehand; if you don’t want to do that you can use the more convenient ‘nltk’ tagger.

phrasemachine.get_phrases(syllabus, tagger='spacy')
{'counts': Counter({u'42 u.s.c.': 1,
          u'absolute prohibition': 1,
          u'absolute prohibition on ritual sacrifice': 1,
          u'animal cruelty': 1,
          u'animal cruelty laws': 1,
          u'animal sacrifice': 1,
          u'animals for food': 1,
          u'carotid arteries': 1,
          u'city council': 1,
          u'city residents': 1,
          u'clause of the first': 1,
          u'clause of the first amendment': 1,
          u'compelling governmental interests': 1,
          u'court of appeals': 1,
          u'cruelty laws': 1,
          u'cruelty to animals': 1,
          u'death rites': 1,
          u'district court': 1,
          u'emergency public session': 1,
          u'enactments resolution': 1,
          u'exception to that prohibition': 1,
          u'exception to that prohibition for religious conduct': 1,
          u'exercise clause': 1,
          u'exercise clause of the first': 1,
          u'exercise clause of the first amendment': 1,
          u'first amendment': 1,
          u'florida animal': 1,
          u'florida animal cruelty': 1,
          u'florida animal cruelty laws': 1,
          u'food consumption': 1,
          u'foregoing ordinances': 1,
          u'forms of devotion': 1,
          u'free exercise': 1,
          u'free exercise clause': 1,
          u'free exercise clause of the first': 1,
          u'free exercise clause of the first amendment': 1,
          u'fulfillment of the governmental interest': 1,
          u'governmental interest': 1,
          u'governmental interests': 1,
          u'health risks': 1,
          u'house of worship': 1,
          u'inter alia': 1,
          u'killing of animals': 1,
          u'killing of animals for food': 1,
          u'killings for religious reasons': 1,
          u'land in respondent': 1,
          u'land in respondent city': 1,
          u'manner as ordinance': 1,
          u'narrow restrictions': 1,
          u'numbers of hogs': 1,
          u'other enactments': 1,
          u'other enactments resolution': 1,
          u'other facilities': 1,
          u'other things': 1,
          u'petitioner church': 1,
          u'primary purpose': 1,
          u'primary purpose of food': 1,
          u'primary purpose of food consumption': 1,
          u'principal forms': 1,
          u'principal forms of devotion': 1,
          u'prohibition for religious conduct': 1,
          u'prohibition on ritual sacrifice': 1,
          u'public health': 1,
          u'public health risks': 1,
          u'public morals': 1,
          u'public session': 1,
          u'purpose of food': 1,
          u'purpose of food consumption': 1,
          u'religious conduct': 1,
          u'religious practices': 1,
          u'religious reasons': 1,
          u'respondent city': 1,
          u'result of the santeria': 1,
          u'result of the santeria religion': 1,
          u'ritual sacrifice': 1,
          u'rituals except healing': 1,
          u'sacrifice of animals': 1,
          u'same manner': 1,
          u'same manner as ordinance': 1,
          u'santeria religion': 2,
          u'santeria rituals': 1,
          u'santeria rituals except healing': 1,
          u'secret nature': 1,
          u'small numbers': 1,
          u'small numbers of hogs': 1,
          u'state law': 1,
          u'such practices': 1,
          u'suit under 42 u.s.c.': 1,
          u'type of ritual': 1}),
 'num_tokens': 460}

Using just the defaults, we already see a number of phrases that we might be interested in adding to our term-document matrix (e.g., “Free Exercise Clause”), but also some we could probably do without, like “suit under 42 u.s.c.”

Now we can generate a list of the 20 most common noun phrases in the full opinion. (Note that as before, I’ve cleaned the opinion ahead of time to speed things up.) Focusing on the top 20 lets us see more easily which noun phrases might enhance our bag of words representation of the opinion.

opinion_page = requests.get('http://pastebin.com/raw/8bZQ9f6h')
opinion_soup = BeautifulSoup(opinion_page.content, "html.parser")
opinion = opinion_soup.get_text()
nps = phrasemachine.get_phrases(opinion, tagger='spacy')
nps['counts'].most_common(20)
[(u'free exercise', 17),
 (u'city council', 17),
 (u'animal sacrifice', 16),
 (u'free exercise clause', 15),
 (u'exercise clause', 15),
 (u'district court', 13),
 (u'first amendment', 12),
 (u'723 f.supp', 11),
 (u'religious practice', 10),
 (u'santeria sacrifice', 10),
 (u'religious conduct', 9),
 (u'governmental interests', 9),
 (u'state law', 8),
 (u'public health', 8),
 (u'v. smith', 7),
 (u'494 u.s.', 7),
 (u'resources of oregon v. smith', 7),
 (u'human resources of oregon v.', 7),
 (u'oregon v. smith', 7),
 (u'employment div', 7)]

We could do without some of the legal citations, but adding most of these noun phrases to our model would be helpful.

The phrasemachine library can be found on Github, check it out!