Seeker Indexer Architecture Changes

August 6, 2010

Great changes are happening with our Dovetail Seeker search product. I wanted to talk about a core change to the product. By moving our application architecture to use messaging we are seeing increased flexibility, testability and performance.

Why Messaging?

Based on our experience building Dovetail Carrier (pdf), our messaging oriented enterprise integration solution, we decided messaging was a great fit for Seeker’s search indexing windows service.

We decomposed the indexer’s behaviors into work (message) producers and consumers. For example to keep your database in-sync with the search index we have a service which watches the database for changes to the objects Seeker is indexing. When a database item is added or updated a message is created which gets put onto a work queue.

We have many message consumer threads standing by waiting for work to munch on. Our message consumers read like our feature set:

UpdateClarifyDocument – indexes your Clarify objects.
UpdateClarifyAttachment – indexes the file attachments on your Clairfy objects.
UpdateFileDocument – indexes files

Each message is basically a feature. You’ll may notice may we’ve added indexing of files and attachments for clarify objects, but that’s another blog post. This new architecture makes it easier for us to plug in and try out new functionality.

What is the down side?

We are taking a dependency on Microsoft Message Queue. Which means there are more moving parts to keep track of.

A message oriented system is a bit tougher to debug as you have to get used to the parallel nature the messages can exhibit. More than once message can be consumed at a time which means we have to be careful to avoid introducing side-effects into our code.

Added complexity in managing the Lucene index writer. Only one writer is allowed to modify the index at a time so when multiple messages are running simultaneously we have to be careful.

Stopping the indexer windows service under heavy load can take a while because all the consumers have to finish their workload before completing.

Wow that is a lot of down side. Why is this worth it?

But there is a sunny upside.

I already mentioned that it is easier to plug in new functionality. We learned on Dovetail Carrier how to make it super easy to extend a message based system. We leveraged that knowledge in Seeker.

Performance. We are now getting parallelism in our message processing. Before our search indexer would chug through one clarify object at a time. Get it from the database. Update the index. Repeat. Now we have multiple consumer all working together on discrete pieces of the indexing puzzle. Although, there is a bit of overhead due to all the message passing going on.

Let’s talk more about performance.

Unscientific Back of the Napkin Benchmark Using Paper and Pencil

This is not very accurate or scientific benchmark. Your mileage may vary depending on your setup and such but on my development bench running in the background while I am failing to get other work done. That said. I like what what I see.

The Indexer – Where seeker gets installed.

My development machine is not stellar but pretty sweet. 4 cores, 8 cores with hyper threading, 8GB memory.

I installed Dovetail Seeker on my dev machine and pointed the database settings at a sample database where we have a decent sized set of data.

The Mark – The database under test.

The sample SQL Server real world database has over one million cases and a decent number of contacts and solutions. The machine is a low end (4 year old) server. The connection is a 100MB network connection with 2 hops.

I have attachment and file indexing turned off to make the comparison against the previous version of Seeker more apples to apples. Here is a snapshot of the number of objects we are dealing with.

That is 1.06 million database objects to index.

Indexing Run Results

I did 3 runs. The first with Seeker 1.5.1 out of the box. The last two runs are Seeker 2.0 with different numbers of consumer threads.

Version	Consumers	Mins. to index	Objects/Second
1.5.1	n/a	105.25	169
2.0	4	46.5	382
2.0	16	36.75	484

Carry the one. Dot the i… The new Seeker is seeing a 225 to 286% increase in indexing performance. I am pretty happy with ~1.7 million database objects per hour.

Consumers

An interesting detail is the significant boost in performance due to adding consumers to the run. Our Message Bus recommends 2 threads per processor and clearly that is good advice.

The problem is that under load when you tell the windows service to shutdown it takes a long time for all the consumer threads to finish up. Thankfully this is an edge case and we are looking at ways to improve this part of the user experience. For now if you need to control the number of consumers to improve the timeliness of shutdown there is a configuration setting for that.

Conclusion

We are really excited to get these improvements into the hands of our customers. Some of our customers have a very large number of objects being indexed and can experience multi-day indexing times. Seeker 2.0 is no silver bullet to making your indexing run go faster but every bit really helps.

Why Messaging?

What is the down side?

But there is a sunny upside.

Unscientific Back of the Napkin Benchmark Using Paper and Pencil

The Indexer – Where seeker gets installed.

The Mark – The database under test.

Indexing Run Results

Consumers

Conclusion

Related

Dovetail

HR SOLUTIONS

CLARIFY SOLUTIONS

Seeker Indexer Architecture Changes

Why Messaging?

What is the down side?

But there is a sunny upside.

Unscientific Back of the Napkin Benchmark Using Paper and Pencil

The Indexer – Where seeker gets installed.

The Mark – The database under test.

Indexing Run Results

Consumers

Conclusion

Share this:

Related