Higher-level Natural Language Processing with textacy

01:45 PM - 02:10 PM on July 17, 2016, Room CR7

Burton DeWilde

Audience level:: intermediate

Description

Although the Python community is already served by many open-source packages for NLP, few if any offer state-of-the-art performance; a holistic, higher-level feature set; and a user-friendly API. textacy was built to fill the gap. In this talk, I will demonstrate some of textacy's functionality by working through an end-to-end example of topic modeling.

Abstract

Interest in the field of Natural Language Processing (NLP) is high and rising, and the field itself is advancing rapidly. There are a number of Python packages for NLP, including the venerable NLTK and pattern libraries, vector space model-oriented gensim and sklearn, and multilingual polyglot. In particular, a recent addition to the ecosystem called spaCy (https://spacy.io/) offers state-of-the-art performance on foundational NLP tasks such as tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. textacy (http://textacy.readthedocs.io/) builds upon the solid foundation provided by spaCy and others in order to perform higher-level NLP tasks, with an emphasis on text analysis of news articles.

A given NLP task typically involves far more than just applying an algorithm or model to one or more text documents, so textacy attempts to cover that wider range of functionality. It efficiently streams data to and from disk in several formats; provides functions for cleaning/normalizing raw text; links document content and metadata for better bookkeeping and richer document representations; transforms documents into bags-of-words, semantic networks, and term lists; extracts structured information such as acronym definitions, subject-verb-object triples, and quotations; applies models to parsed documents, including unsupervised keyterm extraction and LSA, LDA, and NMF topic models; and facilitates visualization and interpretation of analysis results. It also includes easy access to standard and novel corpora, from a stream of plaintext or structured Wikipedia articles to a collection of Congressional speeches by Bernie Sanders and Hillary Clinton.

In this talk, I will demonstrate some of textacy's functionality by working through an end-to-end example of topic modeling.