Lucene - Core
  1. Lucene - Core
  2. LUCENE-3435

Create a Size Estimator model for Lucene and Solr

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/other
    • Labels:
      None

      Description

      It is often handy to be able to estimate the amount of memory and disk space that both Lucene and Solr use, given certain assumptions. I intend to check in an Excel spreadsheet that allows people to estimate memory and disk usage for trunk. I propose to put it under dev-tools, as I don't think it should be official documentation just yet and like the IDE stuff, we'll see how well it gets maintained.

        Activity

        Hide
        Otis Gospodnetic added a comment -

        Grant - what is your experience with this estimator (the one you just committed)? That is, how often is it right or close (how close?) to what you see in reality, assuming you give it correct input?

        Show
        Otis Gospodnetic added a comment - Grant - what is your experience with this estimator (the one you just committed)? That is, how often is it right or close (how close?) to what you see in reality, assuming you give it correct input?
        Hide
        Grant Ingersoll added a comment -

        A good deal of it Mike and I worked out yesterday on IRC (well, mostly Mike explained and I took copious notes). The disk storage stuff is based on LIA2. It is a theoretical model and not an empirical one other than the bytes/term calculation was based off of indexing wikipedia.

        I would deem it a gross approximation of the state of trunk at this point in time. My gut says the Lucene estimation is a little low, while Solr is fairly close (since I suspect Solr's memory usage is dominated by caching). I imagine there are things still unaccounted for. For instance, I haven't reverse engineered the fieldValueCache memSize() method yet and I don't have a good sense of how much memory would be consumed in a highly concurrent system by the sheer number of Query objects instantiated or when one has really large Queries (say 5K terms). It also is not meant to be one size fits all. Lucene/Solr have a ton of tuning options that could change things significantly.

        I did a few sanity checks against things I've seen in the past, and thought it was reasonable. There is, of course, no substitute for good testing. In other words, caveat emptor.

        Show
        Grant Ingersoll added a comment - A good deal of it Mike and I worked out yesterday on IRC (well, mostly Mike explained and I took copious notes). The disk storage stuff is based on LIA2. It is a theoretical model and not an empirical one other than the bytes/term calculation was based off of indexing wikipedia. I would deem it a gross approximation of the state of trunk at this point in time. My gut says the Lucene estimation is a little low, while Solr is fairly close (since I suspect Solr's memory usage is dominated by caching). I imagine there are things still unaccounted for. For instance, I haven't reverse engineered the fieldValueCache memSize() method yet and I don't have a good sense of how much memory would be consumed in a highly concurrent system by the sheer number of Query objects instantiated or when one has really large Queries (say 5K terms). It also is not meant to be one size fits all. Lucene/Solr have a ton of tuning options that could change things significantly. I did a few sanity checks against things I've seen in the past, and thought it was reasonable. There is, of course, no substitute for good testing. In other words, caveat emptor.
        Hide
        Christopher Ball added a comment - - edited

        Grant - Great start =)

        Below is some initial feedback (happy to help further if you want to chat in real-time)

        Quickly Groking - To make it easier to quickly comprehend, the cells that are to be updated in the spreadsheet should be color coded (as opposed to those that are calculated)

        Bytes or Entries - You list Max Size for filterCache, queryResultCache, and documentCache as 512 which subtle implies the size is based on bytes when the units of the cache are actually the number of entries. I would clarify the unit of measure (I've seen numerous blogs and emails confuse this).

        Approach to Cache Sizing - Given memory requirements are heavily contingent on caching I would suggest including at least one approach for how to determine cache size

        • Query Result Cache
          • Estimation: should be greater than 'number of commonly reoccurring unique queries' x 'number of sort parameters' x 'number of possible sort orders'
        • Document Cache
          • Estimation: should be greater than 'maximum number of documents per query' x 'maximum number of concurrent queries'
        • Filter Cache
          • Estimation: should be number of unique filter queries (should clarify what constitutes 'unique')
        • Field Value Cache
          • Estimation: should be ?
        • Custom Caches
          • Estimation: should be ? - A common use case?

        Faceting - Surprised there is no reference to use of faceting which is both increasingly common default query functionality and further increases memory requirements for effective use

        Obscure Metrics - To really give this spreadsheet some teeth, there really should be pointers for at least one approach on how to estimate each input metric (could be on another tab).

        • Some are fairly easy:
          • Number of Unique Terms / field
          • Number of documents
          • Number of indexed fields (no norms)
          • Number of fields w/ norms
          • Number of non-String Sort Fields other than score
          • Number of String Sort Fields
          • Number of deleted docs on avg
          • Avg. number of terms per query
        • Some are quite obscure (and guidance on how to estimate is essential):
          • Numberof RAM-based Column Stride Fields (DocValues)
          • ramBufferSizeMB
          • Transient Factor (MB)
          • fieldValueCache Max Size
          • Custom Cache Size (MB)
          • Avg. Number of Bytes per Term
          • Bytes/Term
          • Field Cache bits/term
          • Cache Key Avg. Size (Bytes)
          • Avg QueryResultKey size (in bytes)
        Show
        Christopher Ball added a comment - - edited Grant - Great start =) Below is some initial feedback (happy to help further if you want to chat in real-time) Quickly Groking - To make it easier to quickly comprehend, the cells that are to be updated in the spreadsheet should be color coded (as opposed to those that are calculated) Bytes or Entries - You list Max Size for filterCache, queryResultCache, and documentCache as 512 which subtle implies the size is based on bytes when the units of the cache are actually the number of entries. I would clarify the unit of measure (I've seen numerous blogs and emails confuse this). Approach to Cache Sizing - Given memory requirements are heavily contingent on caching I would suggest including at least one approach for how to determine cache size Query Result Cache Estimation: should be greater than 'number of commonly reoccurring unique queries' x 'number of sort parameters' x 'number of possible sort orders' Document Cache Estimation: should be greater than 'maximum number of documents per query' x 'maximum number of concurrent queries' Filter Cache Estimation: should be number of unique filter queries (should clarify what constitutes 'unique') Field Value Cache Estimation: should be ? Custom Caches Estimation: should be ? - A common use case? Faceting - Surprised there is no reference to use of faceting which is both increasingly common default query functionality and further increases memory requirements for effective use Obscure Metrics - To really give this spreadsheet some teeth, there really should be pointers for at least one approach on how to estimate each input metric (could be on another tab). Some are fairly easy: Number of Unique Terms / field Number of documents Number of indexed fields (no norms) Number of fields w/ norms Number of non-String Sort Fields other than score Number of String Sort Fields Number of deleted docs on avg Avg. number of terms per query Some are quite obscure (and guidance on how to estimate is essential): Numberof RAM-based Column Stride Fields (DocValues) ramBufferSizeMB Transient Factor (MB) fieldValueCache Max Size Custom Cache Size (MB) Avg. Number of Bytes per Term Bytes/Term Field Cache bits/term Cache Key Avg. Size (Bytes) Avg QueryResultKey size (in bytes)
        Hide
        Eric Pugh added a comment -

        A pretty small change, but shouldn't the fieldValueCache Max Size be 10000? If you don't specify a size for fieldValueCache, then Solr on startup generates one that is 100000 max size.

        Show
        Eric Pugh added a comment - A pretty small change, but shouldn't the fieldValueCache Max Size be 10000? If you don't specify a size for fieldValueCache, then Solr on startup generates one that is 100000 max size.
        Hide
        Grant Ingersoll added a comment -

        A patch would be great for all of these things. Thanks!

        Show
        Grant Ingersoll added a comment - A patch would be great for all of these things. Thanks!
        Hide
        Grant Ingersoll added a comment -

        Still has work needed, but the bones are there and new issues can be opened as necessary.

        Show
        Grant Ingersoll added a comment - Still has work needed, but the bones are there and new issues can be opened as necessary.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development