As Stanislas explained in this previous post, we believe we are dealing with too much information to keep up on the Internet. The natural solution is text summarization, and computers can help us there.
There are roughly two kinds of summaries: the ones that consist of a few of the most important sentences extracted from the text, the keyphrases, and the ones that are written from scratch without necessarily using the same phrasing, called abstract summaries.
Finding the most relevant content: keyphrase extraction
The big advantage of keyphrase-based summaries is that there is no risk of miscomprehension since they are composed of unrephrased pieces of the text. Moreover, computers are able to compute them and get decent results. There are two main ways to do it: supervised and unsupervised extraction.
When you know the context: extraction as supervised learning
One way to perform this task is described by Peter Turney in this paper. He found that using a custom-designed algorithm that incorporates specialized domain knowledge on a text generates yields very good results: 80% of keyphrases are deemed acceptable by human readers. The problem with this approach is that it requires the computer to know the topic beforehand. It is possible to use a prior first step to determine what it is, but this is difficult and the results are terrible if we get it wrong.
Applicable to any text: unsupervised keyphrase extraction
This approaches enables us to extract keyphrases from a text without any prior knowledge, which is a tremendous improvement upon the previous method. There a several algorithms, and one notable current example is Summly. I will explain one of those, named TextRank, to give you a feeling of the kind of techniques used in this field.
The TextRank algorithm was developped by Rada Mihalcea and Paul Tarau in this paper. The key idea is that a text unit (word or phrase depending on the need) that is close to a lot of other important text units is likely a keyphrase. Using the parallel between the “close to” relation and the “referred to by” relation of hyperlinks, TextRank uses Google’s PageRank to determine which text units are the most important. It works well, its main author was even awarded a $100k grant by Google for her work on keyphrase extraction.
The best: abstract summarization
Even though they can be very useful, keyphrase-based summaries are still rudimentary compared to abstract summaries, because they are much nicer to read, and we are sure no important information was “left behind” (i.e a keyphrase was not extracted). Also, keyphrase extraction algorithms still find it hard to summarize technical articles.
Abstract summarization is the limit of what computers can do today. The main reason for this is that they struggle with the following tasks:
- Coreferences understanding. For example, who does ‘he’ references in the sentence ‘Paul told John he should play squash’?
- Word sense disambiguation. In ‘I need batteries for my mouse’, are we talking about the animal or the device?
- Sentiment analysis. Understanding the mood of the writer is often key to understanding a text, all the more when there is sarcasm.
Can computers help humans summarize text?
Today, computers are not good at summarizing text, and they probably will be for a long time. Nevertheless, we may be able to make significant progress if we had a huge database of summarize. That’s what we’re working on, so future will tell!
Note: some examples were taken from Stanford’s Natural Language Processing online class. If you couldn’t take it this quarter, look for it, it is a terrific place to learn about this topic