Lucene’s docFreq Got You Down? Replace It With a Custom Collector
I came across a weird thing with Lucene using their document frequency API.
int docFreq(Term term) – Expert: Returns the number of documents containing
term
. Called by search code to compute term weights.
You can use this call to quickly find the number of documents in your index matching a term you give it. The problem I ran into was that when you delete documents they still show up in the count of documents returned by docFreq(). Worse yet document frequencies will include deleted documents until an index optimization is done. Yikes! Index optimization is very very slow and expensive so I really do not want to optimize just because we deleted a few documents. The real answer is not to use docFreq at all. We can instead use a custom Collector to get the desired effect.
My Problem
I have an administrative view that looks at the search index and shows you the count of each “kind” of document present in the index.
An inaccurate count of index contents would cause administrator confusion. And it could easily happen too. During a re-index I delete all of a type’s documents and start re-adding them. The total number of documents in some cases would double. Why? Because docFreq() was counting deleted documents. Let’s fix that.
A Counting Collector
When you do searches in Lucene you can give the searcher a Collector which feels a bit like the visitor pattern as the search calls your collector once for each document that matches your query.
public class CounterCollector : Collector { public int Count { get ; private set ; } public void Reset() { Count = 0; } public override void Collect( int docID) { Count = Count + 1; } public override void SetScorer(Scorer scorer) { } public override void SetNextReader(IndexReader reader, int docBase) { } public override bool AcceptsDocsOutOfOrder() { return true ; } } |
This very simple collector just tracks the number times the Collect method is called. Better yet it does not count deleted documents. So finally I updated my search code.
How Do I Use One Of Those?
public int GetNumberOfDocumentsForTerm(Term term) { return searchIndex(searcher => { //replacing this //return searcher.DocFreq(term); //with this var counterCollector = new CounterCollector(); searcher.Search( new TermQuery(term), counterCollector); return counterCollector.Count; }); } private T searchIndex<T>(Func<IndexSearcher, T> searchAction) { var indexSearcher = new IndexSearcher(_directory, true ); T result; try { result = searchAction(indexSearcher); } catch (BooleanQuery.TooManyClauses tooManyClausesException) { throw new SearchException( "Your wildcard query was too broad please narrow your search. Example - change a* to apple*" , tooManyClausesException); } indexSearcher.Close(); return result; } |
Take Away
Doing searches with custom collectors is quite easy. You just give an instance of one to the search method and interrogate it afterwards for the information you require.
You might notice some too fancy code to get this going but hopefully you get the idea. You may find the Func-y way of calling the searcher weird but it’s handy way to have all your search code do the same error handling.