REQUEST A DEMO

How much disk space do I need for Seeker indexes?

 

As we continue to roll out new instances of the Dovetail Seeker search engine to our customers, one question that frequently arises is: How much disk space do I need for the Dovetail Seeker indexes?

A little background

 

Dovetail Seeker contains two major components: indexing and searching. Before you can search for your data, you need to index it.

 

An index is a collection of searchable data organized into documents, each having many fields of information. Every document in the index is a potential search result with each document’s field potentially containing one or more searchable terms.

 

For example, you will likely wish to search for cases. For each case in the system, the indexing application will add adocument to the index containing details about that case. The document will contain at least an id, title, and case summary. Once a document for that case is present in the index it can be searched – the case id, title, and summary are available as search results.

 

Files can also be indexed. When the indexer encounters a CRM attachment or is told to index a directory a file document is added to the index with the text extracted from the file used as the summary searchable contents.

 

These indexes used by Seeker are stored on disk, not in the Clarify/Dovetail database.

So how much space do I need?

 

My typical consultant answer: it depends.

 

It depends on  the size of the data that’s being indexed. For example, lets say we’re indexing a case, which includes the case history. Is the case history small, such as just a few notes? Or is it large  with tons of notes, phone logs, inbound and outbound email logs, etc? One case might be only a few kilobytes in size, while others might be 100KB or more.

 

The larger the amount of data that needs to be indexed, the larger the index.

How about some guidelines?

 

I’ve asked some of our customers who use Seeker to share their specific details. I’ll post specific information below.

 

Averaging out customer data, a good general guideline seems to be about 2 KB/document, where a document is a case, contact, subcase, solution, etc.

 

Based on that estimate:

  • if you index  100,000 documents, then the space required would be 195 MB
  • if you index 1 million documents, then the space required would be 1.9 GB
  • if you index 10 million documents, then the space required would be 19 GB

 

Overall, the storage is a relatively small amount. Since Seeker uses the excellent Lucene.Net search library, we really owe much of this performance to them.

How about some real-world specific data

 

The following collection of data is specific, real-world, customer data:

Total # of documents Document breakdown Total Disk Space Disk Space / Document
4,794,208 4.6M cases
75,000 solutions
1500 contacts
75,000 subcases
7.66 GB 1.6 KB/document
1,729,402 1.3M cases
39,000 CRs
20,000 Problems
4000 subcases
319K Logistics
1.35 GB 0.81 KB/document
141,356 72,000 cases
4000 solutions
64,000 CRs
1200 subcases
264 MB 1.9KB/document
8,667 8667 Cases 25.2 MB 3 KB/document
35,085 30,237 Cases
4,848 Contacts
896 File Attachments
76.3 MB 2.23 KB/document

 

As you can see, there’s some fluctuation compared to the general guideline, hence my non-committal response of “it depends” still stands. Regardless, we’re in the right order of magnitude.

How about external files

 

In addition to objects from the database, Seeker can also index external files. These could be attachments (such as case or subcase attachments), or a collection of files, such as product documentation, whitepapers, etc.

 

The following is specific data from indexing of files:

Fileset Total file size Total Disk Space for index of these files Percentage of index size to file size
56 PDF files 90.5 MB 3.75 MB 4.1 %
56 MS Word DOC files 32.3 MB 196 KB 0.6 %

Have data to share?

 

If you have Seeker implemented in your environment, and you haven’t previously shared your Seeker sizing stats with us – please do so. We’d love to hear your specifics numbers. Email me at gary@dovetailsoftware.com

Postlude

 

Hopefully this information is useful when planning out your Dovetail Seeker implementation.

 

Rock on.