How much disk space do I need for Seeker indexes?
As we continue to roll out new instances of the Dovetail Seeker search engine to our customers, one question that frequently arises is: How much disk space do I need for the Dovetail Seeker indexes?
A little background
Dovetail Seeker contains two major components: indexing and searching. Before you can search for your data, you need to index it.
An index is a collection of searchable data organized into documents, each having many fields of information. Every document in the index is a potential search result with each document’s field potentially containing one or more searchable terms.
For example, you will likely wish to search for cases. For each case in the system, the indexing application will add adocument to the index containing details about that case. The document will contain at least an id, title, and case summary. Once a document for that case is present in the index it can be searched – the case id, title, and summary are available as search results.
Files can also be indexed. When the indexer encounters a CRM attachment or is told to index a directory a file document is added to the index with the text extracted from the file used as the summary searchable contents.
These indexes used by Seeker are stored on disk, not in the Clarify/Dovetail database.
So how much space do I need?
My typical consultant answer: it depends.
It depends on the size of the data that’s being indexed. For example, lets say we’re indexing a case, which includes the case history. Is the case history small, such as just a few notes? Or is it large with tons of notes, phone logs, inbound and outbound email logs, etc? One case might be only a few kilobytes in size, while others might be 100KB or more.
The larger the amount of data that needs to be indexed, the larger the index.
How about some guidelines?
I’ve asked some of our customers who use Seeker to share their specific details. I’ll post specific information below.
Averaging out customer data, a good general guideline seems to be about 2 KB/document, where a document is a case, contact, subcase, solution, etc.
Based on that estimate:
- if you index 100,000 documents, then the space required would be 195 MB
- if you index 1 million documents, then the space required would be 1.9 GB
- if you index 10 million documents, then the space required would be 19 GB
Overall, the storage is a relatively small amount. Since Seeker uses the excellent Lucene.Net search library, we really owe much of this performance to them.
How about some real-world specific data
The following collection of data is specific, real-world, customer data:
|Total # of documents||Document breakdown||Total Disk Space||Disk Space / Document|
|7.66 GB||1.6 KB/document|
|1.35 GB||0.81 KB/document|
|8,667||8667 Cases||25.2 MB||3 KB/document|
896 File Attachments
|76.3 MB||2.23 KB/document|
As you can see, there’s some fluctuation compared to the general guideline, hence my non-committal response of “it depends” still stands. Regardless, we’re in the right order of magnitude.
How about external files
In addition to objects from the database, Seeker can also index external files. These could be attachments (such as case or subcase attachments), or a collection of files, such as product documentation, whitepapers, etc.
The following is specific data from indexing of files:
|Fileset||Total file size||Total Disk Space for index of these files||Percentage of index size to file size|
|56 PDF files||90.5 MB||3.75 MB||4.1 %|
|56 MS Word DOC files||32.3 MB||196 KB||0.6 %|
Have data to share?
If you have Seeker implemented in your environment, and you haven’t previously shared your Seeker sizing stats with us – please do so. We’d love to hear your specifics numbers. Email me at email@example.com
Hopefully this information is useful when planning out your Dovetail Seeker implementation.