Tell me something I don’t know: Detecting novelty and redundancy with natural language processing

Introducing Pythia, a new Lab41 challenge

Patrick Callier
Gab41
Published in
5 min readAug 30, 2016

--

Large collections of documents can be hard for humans to sift through on their own. High-quality search can help find what you want, and if you have the resources to annotate documents with tags or taxonomic categories, you can bring some order to an unwieldy corpus. But when the number of potentially relevant documents is high and your time to individually examine them is low, what really matters is finding documents that tell you what you don’t already know.

An audience with the Pythia

Pythia is Lab41’s exploration of solutions to this problem: can we flag novel and redundant documents on a given topic so that precious human and machine resources can be directed where they are needed?

Imagine you have a stream of documents on a particular topic coming into your possession one at a time. For each one that arrives, Pythia is trying to answer the question: does this document tell me something that isn’t in the documents I’ve already seen (novel)? Or does it basically repeat information that has already arrived in the corpus (redundant)?

Data

NIST sponsored a competition on novelty detection in the early 2000s; its “novelty track” had performers attempting to identify what sentences in a document contained new and relevant information on a given topic, as measured against the assessments of actual humans on the same test. After a couple of hiccups, including annotation snafus that rendered the entire 2002 dataset release almost unusable, the performers in the final 2004 challenge came back with their entries. In tasks focused on retrieving novel sentences, few entries achieved above baseline performance, and none of them by much.

Performance on a novelty detection task in 2004. The eleven round dots on the left represent algorithms with slightly above baseline performance (from http://trec.nist.gov/pubs/trec13/papers/NOVELTY.OVERVIEW.pdf).

Part of the problem is probably in the data. What does it mean to be novel? Two texts of a given length are almost certain to bear some difference on a semantic level, and even at the single-document level, a sentence is unlikely to convey only given information.

We decided that novelty has to have some sort of practical meaning related to what people are actually doing with the documents. The Stack Exchange Data Dump was an obvious point of departure. When two questions are getting at exactly the same problem, community members in the Stack Exchange family of sites often flag one as a duplicate of the other, to help route traffic to high-quality questions and the correct answers. Stack Exchange users also often link to related questions that are nevertheless not exactly the same.

The distinction between related and duplicate questions is a great analogue for novelty vs. redundancy, and we are lucky that the Stack Exchange community has offered up its entire archives under Creative Commons. With some preprocessing and clever querying, we have been able to come up with thousands of query documents and the reference documents they do or do not duplicate.

This allows us to get to the good part: finding a machine learning algorithm to help us tell the new apart from the old!

What we’re trying

Baseline techniques

To our minds, the simplest way to compare a document to one or more other documents is to represent the incoming document as a bag-of-words vector and its predecessors also as a bag-of-words vector — the sum of all the bags-of-words for the individual predecessor documents. Such a simple featurization scheme leaves a lot of signal on the table, so you may not be surprised to know that, by itself, bag-of-words has performed poorly in our experiments so far.

We were also intrigued by the notion of “temporal inverse document frequency,” used for novelty detection in this arXiv paper. The result of the technique is a document-level version of tf-idf, using a discounting function for the age of the document being compared. Unfortunately, the technique yields only a single scalar value for each document, and it hasn’t worked very well on the Stack Exchange corpus. Relative novelty is not really a function of document age in the Stack Exchange data, however, so this is also somewhat unsurprising.

Skip-thought vectors

Between sparse, high-dimensionality bag-of-words vectors and one-dimensional scalars, perhaps there is a way to strike a bargain on both density and dimensionality? Enter “skip-thought” vectors, a sentence encoding technique that tries to represent a phrase in a way that makes the sentences before it and after it easy to predict.

Skip-thought vectors give you a fixed-size representation for each sentence in a document — using those to generate a fixed-size representation of a whole document is a separate proposition. We plan to try:

  • Just using the titles of Stack Exchange posts, which tend to be the question itself
  • Averaging the vectors for all the sentences in a post
  • Concatenating the vectors for the title, first sentence, and last sentence

We are playing with the code from the paper and have successfully used pre-trained encoders to get vectors for our data. We’re excited about the next step!

Dynamic memory networks

Memory networks have been getting a lot of attention in the question answering (QA) literature. The memory part is the new bit — you can expose a network to input, and it learns functions to encode it efficiently and retrieve it when necessary.

In particular, the dynamic memory networks approach has recently begun to promote itself as a first step towards the NLP Swiss Army knife. Basically, since you can pose any NLP question as a question in natural language, and the answer to such a question itself relies on natural language data, you can show a memory network the text you have a question about and train it to give you the right answer when you ask it.

Our hope is that memory networks, given the correct encoding and retrieval functions, can learn to look at streams of documents and tell us whether arriving documents have fresh information or are just rehashed and warmed-over versions of what we’ve already seen.

We are actively developing toward a platform to test out these solutions. Our awesome intern Audrey has already published her post on our pipeline for preprocessing, training, and evaluation. And we’re excited by the rapid pace of research in NLP these days — if you know of anything you think we should look at, we welcome your thoughts!

Lab41 is a Silicon Valley challenge lab where experts from the U.S. Intelligence Community (IC), academia, industry, and In-Q-Tel come together to gain a better understanding of how to work with — and ultimately use — big data.

Learn more at lab41.org and follow us on Twitter: @_lab41

--

--