Ad-hoc IR

Where IR systems might look for the “needle in the haystack”, topic models will tell you about the overall proportion of hay and needles, and perhaps inform you about the mice that you did not know were there

Topic models helpful when we have a specific information need but no idea how to search for it

Traditionally: retrieve and rank documents by measuring the word overlap between queries and documents. Limited! Words with similar meaning or different forms should also be considered as matching keywords.

Language modeling: allows to capture semantic relationship

Query expansion: use background knowledge to interpret and understand queries and add missing words.

Document Language Modeling

Statistical language model estimates the probability of for word sequences

$$ p(w_1,w_2,\dots,w_n) $$

Often approximate using ngram models.

Then generate the probability of generating a given query (maximum likelihood)

$$ p(q\mid d)=\prod_{w\in q}p(w\mid d)=\prod_{w\in q}\frac{n_{d,w}}{n_{d,.}} $$

For IR, rank documents based on p(q|d), but maximum-likelihood gives zero probability to unseen words and it can throw out good matches. Solve by allocating non-zero probability to missing terms, smoothing.

Smoothing directions

interpolation: discount contribution of seen words and add contribution of unseen and seen words
backoff: trust ML estimation for high count words and redistribute mass for less common words

Applying Topic Models to Document Language Models

Lead relationship between query words and documents by marginalizing all topics.

$$ p(w\mid d) = \sum_k p(w\mid k)p(k\mid d) $$