Where IR systems might look for the “needle in the haystack”, topic models will tell you about the overall proportion of hay and needles, and perhaps inform you about the mice that you did not know were there
Topic models helpful when we have a specific information need but no idea how to search for it
Traditionally: retrieve and rank documents by measuring the word overlap between queries and documents. Limited! Words with similar meaning or different forms should also be considered as matching keywords.
Language modeling: allows to capture semantic relationship
Query expansion: use background knowledge to interpret and understand queries and add missing words.
Statistical language model estimates the probability of for word sequences
$$ p(w_1,w_2,\dots,w_n) $$
Then generate the probability of generating a given query (maximum likelihood)
$$ p(q\mid d)=\prod_{w\in q}p(w\mid d)=\prod_{w\in q}\frac{n_{d,w}}{n_{d,.}} $$
For IR, rank documents based on p(q|d), but maximum-likelihood gives zero probability to unseen words and it can throw out good matches. Solve by allocating non-zero probability to missing terms, smoothing.
Smoothing directions
Lead relationship between query words and documents by marginalizing all topics.
$$ p(w\mid d) = \sum_k p(w\mid k)p(k\mid d) $$