Summarizing documents

02:15 PM - 03:10 PM on July 16, 2016, Room CR4

Mike Williams

Audience level:


Extractive summarization — finding the salient points in a document or corpus — is one of the most fundamental tasks in natural language processing. I’ll show you three ways to do it. One dates back to an IBM Journal article from 1958. One uses topic modeling, a technology from the 2000s. And one uses neural network-derived language embeddings and long short term memory networks — techniques that are only a couple of years old. I’ll explain the algorithms, show code and demos for all three, and I’ll discuss the engineering trade-offs.


This talk focuses on text summarization: taking some text in and returning a shorter document that contains the same information. We’ll cover both single document and multi-document summarization. I’ll show ways to solve the summarization problem that range from extremely simple algorithms that date back to the 1950s to the latest recurrent neural networks. I'll use libraries like scikit-learn and keras. I'll explain how to choose between these approaches, and I'll show working prototype products for each.

This talk also uses summarization as a concrete context in which to introduce the latest research on language modeling and neural networks. While summarization is itself a practical and important task, the algorithms we’ll encounter are relevant for text simplification and translation, audio transcription, image captioning and semantic search — and almost any NLP problem involving the meaning of text.