Lucene’s docFreq Got You Down? Replace It With a Custom Collector
I came across a weird thing with Lucene using their document frequency API.
int docFreq(Term term) – Expert: Returns the number of documents containing
term. Called by search code to compute term weights.
You can use this call to quickly find the number of documents in your index matching a term you give it. The problem I ran into was that when you delete documents they still show up in the count of documents returned by docFreq(). Worse yet document frequencies will include deleted documents until an index optimization is done. Yikes! Index optimization is very very slow and expensive so I really do not want to optimize just because we deleted a few documents. The real answer is not to use docFreq at all. We can instead use a custom Collector to get the desired effect.
I have an administrative view that looks at the search index and shows you the count of each “kind” of document present in the index.
An inaccurate count of index contents would cause administrator confusion. And it could easily happen too. During a re-index I delete all of a type’s documents and start re-adding them. The total number of documents in some cases would double. Why? Because docFreq() was counting deleted documents. Let’s fix that.
A Counting Collector
When you do searches in Lucene you can give the searcher a Collector which feels a bit like the visitor pattern as the search calls your collector once for each document that matches your query.
This very simple collector just tracks the number times the Collect method is called. Better yet it does not count deleted documents. So finally I updated my search code.
How Do I Use One Of Those?
Doing searches with custom collectors is quite easy. You just give an instance of one to the search method and interrogate it afterwards for the information you require.
You might notice some too fancy code to get this going but hopefully you get the idea. You may find the Func-y way of calling the searcher weird but it’s handy way to have all your search code do the same error handling.