Lucene - Core
  1. Lucene - Core
  2. LUCENE-4602

Use DocValues to store per-doc facet ord

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 5.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff from LUCENE-4600

      DocValues can be used to hold the byte[] encoding all facet ords for
      the document, instead of payloads. I made a hacked up approximation
      of in-RAM DV (see CachedCountingFacetsCollector in the patch) and the
      gains were somewhat surprisingly large:

                          Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                      HighTerm        0.53      (0.9%)        1.00      (2.5%)   87.3% (  83% -   91%)
                       LowTerm        7.59      (0.6%)       26.75     (12.9%)  252.6% ( 237% -  267%)
                       MedTerm        3.35      (0.7%)       12.71      (9.0%)  279.8% ( 268% -  291%)
      

      I didn't think payloads were THAT slow; I think it must be the advance
      implementation?

      We need to separately test on-disk DV to make sure it's at least
      on-par with payloads (but hopefully faster) and if so ... we should
      cutover facets to using DV.

      1. TestFacetsPayloadMigrationReader.java
        17 kB
        Shai Erera
      2. LUCENE-4602.patch
        24 kB
        Michael McCandless
      3. LUCENE-4602.patch
        20 kB
        Michael McCandless
      4. LUCENE-4602.patch
        99 kB
        Shai Erera
      5. LUCENE-4602.patch
        102 kB
        Shai Erera
      6. LUCENE-4602.patch
        159 kB
        Shai Erera
      7. FacetsPayloadMigrationReader.java
        7 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Patch with another prototype DV-backed collector. This one only works
          for single-valued fields (in "dimension" terminology, I think: the
          document can have multiple dimensions as long as each dimension has
          only one category path under it).

          It stores only a single ord per doc X
          field into a PackedLongDocValuesField. The collector then aggregates
          per-segment, during collection, but only the leaf ords, and then at
          the end, it walks up the hierarchy, summing counts to the parent ords.
          It's the fastest impl so far:

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          HighTerm        0.71      (0.5%)        1.60      (2.3%)  124.1% ( 120% -  127%)
                           MedTerm        4.43      (0.5%)       31.39      (6.6%)  609.1% ( 599% -  618%)
                           LowTerm       10.28      (0.4%)       75.37      (3.8%)  633.0% ( 626% -  639%)
          

          But it's somewhat hacked up (storing DV field directly myself)... Shai
          explained that it's possible now to have facets store only the leaf
          ord, so once we get DV cleanly integrated we should try that and
          re-rest.

          Show
          Michael McCandless added a comment - Patch with another prototype DV-backed collector. This one only works for single-valued fields (in "dimension" terminology, I think: the document can have multiple dimensions as long as each dimension has only one category path under it). It stores only a single ord per doc X field into a PackedLongDocValuesField. The collector then aggregates per-segment, during collection, but only the leaf ords, and then at the end, it walks up the hierarchy, summing counts to the parent ords. It's the fastest impl so far: Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 0.71 (0.5%) 1.60 (2.3%) 124.1% ( 120% - 127%) MedTerm 4.43 (0.5%) 31.39 (6.6%) 609.1% ( 599% - 618%) LowTerm 10.28 (0.4%) 75.37 (3.8%) 633.0% ( 626% - 639%) But it's somewhat hacked up (storing DV field directly myself)... Shai explained that it's possible now to have facets store only the leaf ord, so once we get DV cleanly integrated we should try that and re-rest.
          Hide
          Shai Erera added a comment -

          Shai explained that it's possible now to have facets store only the leaf ord

          This is a long shot (I haven't tried it yet), but I think that if you implement an OrdinalPolicy which always returns false, then only the leaf node will be written. I.e., I look at CategpryParentStream.incrementToken() code, which is used by CategoryDocumentBuilder to encode all the parents:

                int ordinal = this.ordinalProperty.getOrdinal();
                if (ordinal != -1) {
                  ordinal = this.taxonomyWriter.getParent(ordinal);
                  if (this.ordinalPolicy.shouldAdd(ordinal)) {
                    this.ordinalProperty.setOrdinal(ordinal);
                    try {
                      this.categoryAttribute.addProperty(ordinalProperty);
                    } catch (UnsupportedOperationException e) {
                      throw new IOException(e.getLocalizedMessage());
                    }
                    added = true;
                  } else {
                    this.ordinalProperty.setOrdinal(-1);
                  }
                }
          

          It looks like the leaf ordinal is added anyway, still didn't track where ... if you don't get to try it today, I'll verify tomorrow that indeed an OrdPolicy that returns false ends up w/ just leaf nodes written.

          We should compare the current code + leaves only to the DV code + leaves only, to get a better estimate of the gains we should expect when moving to DV.

          Show
          Shai Erera added a comment - Shai explained that it's possible now to have facets store only the leaf ord This is a long shot (I haven't tried it yet), but I think that if you implement an OrdinalPolicy which always returns false, then only the leaf node will be written. I.e., I look at CategpryParentStream.incrementToken() code, which is used by CategoryDocumentBuilder to encode all the parents: int ordinal = this .ordinalProperty.getOrdinal(); if (ordinal != -1) { ordinal = this .taxonomyWriter.getParent(ordinal); if ( this .ordinalPolicy.shouldAdd(ordinal)) { this .ordinalProperty.setOrdinal(ordinal); try { this .categoryAttribute.addProperty(ordinalProperty); } catch (UnsupportedOperationException e) { throw new IOException(e.getLocalizedMessage()); } added = true ; } else { this .ordinalProperty.setOrdinal(-1); } } It looks like the leaf ordinal is added anyway, still didn't track where ... if you don't get to try it today, I'll verify tomorrow that indeed an OrdPolicy that returns false ends up w/ just leaf nodes written. We should compare the current code + leaves only to the DV code + leaves only, to get a better estimate of the gains we should expect when moving to DV.
          Hide
          Shai Erera added a comment -

          I tried it by writing such OrdinalPolicy and indeed only the leaf nodes were written. Here's the code:

          DefaultFacetIndexingParams indexingParams = new DefaultFacetIndexingParams() {
          
            @Override
            protected OrdinalPolicy fixedOrdinalPolicy() {
              return new OrdinalPolicy() {
                public void init(TaxonomyWriter taxonomyWriter) {}
                public boolean shouldAdd(int ordinal) { return false; }
              };
            }
          };
          

          Can you try current+leaves using this code?

          Show
          Shai Erera added a comment - I tried it by writing such OrdinalPolicy and indeed only the leaf nodes were written. Here's the code: DefaultFacetIndexingParams indexingParams = new DefaultFacetIndexingParams() { @Override protected OrdinalPolicy fixedOrdinalPolicy() { return new OrdinalPolicy() { public void init(TaxonomyWriter taxonomyWriter) {} public boolean shouldAdd( int ordinal) { return false ; } }; } }; Can you try current+leaves using this code?
          Hide
          Shai Erera added a comment -

          I think that we should make this a concrete impl in the code, as a singleton. DefaultOrdPolicy can be a singleton too. Opened LUCENE-4604 to track that.

          Show
          Shai Erera added a comment - I think that we should make this a concrete impl in the code, as a singleton. DefaultOrdPolicy can be a singleton too. Opened LUCENE-4604 to track that.
          Hide
          Shai Erera added a comment -

          I reviewed DocValuesCountingFacetsCollector, nice work !

          See my last comment on LUCENE-4565 about taxoReader.getParent, vs. using the parents[] directly. Specifically, I wonder if we'll see any gain if we move to use the parents[] array directly, instead of getParent (in getFacetResults):

          +      if (count != 0) {
          +        int ordUp = taxoReader.getParent(ord); // HERE
          +        while(ordUp != 0) {
          +          //System.out.println("    parent=" + ordUp + " cp=" + taxoReader.getPath(ordUp));
          +          counts[ordUp] += count;
          +          ordUp = taxoReader.getParent(ordUp); // AND HERE
          +        }
          +      }
          
          Show
          Shai Erera added a comment - I reviewed DocValuesCountingFacetsCollector, nice work ! See my last comment on LUCENE-4565 about taxoReader.getParent, vs. using the parents[] directly. Specifically, I wonder if we'll see any gain if we move to use the parents[] array directly, instead of getParent (in getFacetResults): + if (count != 0) { + int ordUp = taxoReader.getParent(ord); // HERE + while (ordUp != 0) { + // System .out.println( " parent=" + ordUp + " cp=" + taxoReader.getPath(ordUp)); + counts[ordUp] += count; + ordUp = taxoReader.getParent(ordUp); // AND HERE + } + }
          Hide
          Michael McCandless added a comment -

          Ooh, you're right: I should just pull the int[] parent array and access it directly.

          I don't think it'll help in my test (very few non-leaf ordinals in my hierarchy) but let's make sure we do that under LUCENE-4610.

          Show
          Michael McCandless added a comment - Ooh, you're right: I should just pull the int[] parent array and access it directly. I don't think it'll help in my test (very few non-leaf ordinals in my hierarchy) but let's make sure we do that under LUCENE-4610 .
          Hide
          Michael McCandless added a comment -

          OK good news! I hacked up a way to index the byte[] into DocValues
          field instead of payloads, and modified the previous
          CachingFacetsCollector to use DocValues instead of its own hacked
          cache (renamed it to DocValuesFacetsCollector):

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          HighTerm        1.27      (2.9%)        2.29      (2.7%)   80.2% (  72% -   88%)
                           MedTerm        4.79      (1.3%)       14.83      (4.0%)  209.5% ( 201% -  217%)
                           LowTerm       10.50      (0.8%)       33.84      (1.9%)  222.3% ( 217% -  226%)
          

          This is only a bit slower than my original hacked up
          CachingFacetsCollector results, so net/net DocValues looks to be just
          as good.

          That was for in-RAM DocValues. Then I tested with DirectSource
          (leaves DocValues on disk, but the file is hot (in OS's IO cache) in
          this test):

                              Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                          HighTerm        1.26      (1.1%)        1.43      (1.0%)   13.8% (  11% -   16%)
                           MedTerm        4.78      (0.5%)       10.22      (1.7%)  113.9% ( 111% -  116%)
                           LowTerm       10.49      (0.4%)       27.95      (1.4%)  166.6% ( 164% -  168%)
          

          Not bad! Only a bit slower than in RAM ... so net/net I think we
          should cutover facets to DVs?

          Show
          Michael McCandless added a comment - OK good news! I hacked up a way to index the byte[] into DocValues field instead of payloads, and modified the previous CachingFacetsCollector to use DocValues instead of its own hacked cache (renamed it to DocValuesFacetsCollector): Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 1.27 (2.9%) 2.29 (2.7%) 80.2% ( 72% - 88%) MedTerm 4.79 (1.3%) 14.83 (4.0%) 209.5% ( 201% - 217%) LowTerm 10.50 (0.8%) 33.84 (1.9%) 222.3% ( 217% - 226%) This is only a bit slower than my original hacked up CachingFacetsCollector results, so net/net DocValues looks to be just as good. That was for in-RAM DocValues. Then I tested with DirectSource (leaves DocValues on disk, but the file is hot (in OS's IO cache) in this test): Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 1.26 (1.1%) 1.43 (1.0%) 13.8% ( 11% - 16%) MedTerm 4.78 (0.5%) 10.22 (1.7%) 113.9% ( 111% - 116%) LowTerm 10.49 (0.4%) 27.95 (1.4%) 166.6% ( 164% - 168%) Not bad! Only a bit slower than in RAM ... so net/net I think we should cutover facets to DVs?
          Hide
          Shai Erera added a comment -

          Awesome !

          I think then we shuld focus on making the cut to DV then !

          First though, we should have a migration plan. I would prefer that we didn't force all our existing customers to do a one-time index upgrade, I'm not at all sure people will be thrilled with the idea. Just to clarify, our=my customers are not the ones that hold the indexes, they are the ones that develop the products and will need to face the real end-customer telling him that he needs to upgrade his potentially uber-large index ...

          So what I was thinking is if the abstraction CLI layer could be used to fetch the ordinals from either payload or DV, based on the segment's version? And also, migrate segments gradually, e.g. as they are merged? Would we need a FacetsAtomicReader for that, which initializes the CLI per the segment version?

          Is it even possible to move payload data to DV during segment merging? I.e., even in the one-time upgrade...

          Please don't mention re-indexing .

          Show
          Shai Erera added a comment - Awesome ! I think then we shuld focus on making the cut to DV then ! First though, we should have a migration plan. I would prefer that we didn't force all our existing customers to do a one-time index upgrade, I'm not at all sure people will be thrilled with the idea. Just to clarify, our=my customers are not the ones that hold the indexes, they are the ones that develop the products and will need to face the real end-customer telling him that he needs to upgrade his potentially uber-large index ... So what I was thinking is if the abstraction CLI layer could be used to fetch the ordinals from either payload or DV, based on the segment's version? And also, migrate segments gradually, e.g. as they are merged? Would we need a FacetsAtomicReader for that, which initializes the CLI per the segment version? Is it even possible to move payload data to DV during segment merging? I.e., even in the one-time upgrade... Please don't mention re-indexing .
          Hide
          Shai Erera added a comment -

          On the patch side, I don't have any comments except, awesome !

          Show
          Shai Erera added a comment - On the patch side, I don't have any comments except, awesome !
          Hide
          Robert Muir added a comment -

          Nice improvement!

          Shai: I don't think we should block progress in open source over your specific customers back compat needs that go above and beyond lucene's back compat promise.

          Ultimately, thats something only you care about, and something your customers pay you to solve.

          Show
          Robert Muir added a comment - Nice improvement! Shai: I don't think we should block progress in open source over your specific customers back compat needs that go above and beyond lucene's back compat promise. Ultimately, thats something only you care about, and something your customers pay you to solve.
          Hide
          Shai Erera added a comment -

          Who said anything about blocking progress? All I'm saying is that before we release these improvements, we need to have a migration plan. These are not just my customers. I think that there are other people that use this module, at least from questions that pop up here and there on the list.

          Sure, we can tell everyone to re-index. But that's not how I prefer to work. I don't think that cutting over to DV is the only migration we should talk about. E.g. LUCENE-4623 would also require migration and any change in the future to how we decide to store/encode facets would require migration.

          It would be good if we can think about a layer that will provide that migration. Today we have Codecs and Lucene guarantees that old segments will be read w/ old Codecs versions (per our back-compat policy). What I would like to develop is something similar, which can read facets from old segments in the old way, and ultimately when segments are merged, migrate data to the new format. Then we can tell customers that if they didn't migrate their indexes when Lucene 6.0 is released, they have to addIndexes or forceMerge or something.

          I know that this module has the @lucene.experimental tag on it all over the place, but I don't treat it as experimental at all. I would prefer that you help me develop this migration layer, even if just by contributing ideas, rather than tell me that it's my problem and that I get paid to solve it .

          I don't think that we should release code that breaks all apps out there and forces them to reindex. Unless the changes are really non-migratable. But in this case, I think it should be easy? If you want to chime in, I'll open a separate issue to discuss this.

          Show
          Shai Erera added a comment - Who said anything about blocking progress? All I'm saying is that before we release these improvements, we need to have a migration plan. These are not just my customers. I think that there are other people that use this module, at least from questions that pop up here and there on the list. Sure, we can tell everyone to re-index. But that's not how I prefer to work. I don't think that cutting over to DV is the only migration we should talk about. E.g. LUCENE-4623 would also require migration and any change in the future to how we decide to store/encode facets would require migration. It would be good if we can think about a layer that will provide that migration. Today we have Codecs and Lucene guarantees that old segments will be read w/ old Codecs versions (per our back-compat policy). What I would like to develop is something similar, which can read facets from old segments in the old way, and ultimately when segments are merged, migrate data to the new format. Then we can tell customers that if they didn't migrate their indexes when Lucene 6.0 is released, they have to addIndexes or forceMerge or something. I know that this module has the @lucene.experimental tag on it all over the place, but I don't treat it as experimental at all. I would prefer that you help me develop this migration layer, even if just by contributing ideas, rather than tell me that it's my problem and that I get paid to solve it . I don't think that we should release code that breaks all apps out there and forces them to reindex. Unless the changes are really non-migratable. But in this case, I think it should be easy? If you want to chime in, I'll open a separate issue to discuss this.
          Hide
          Shai Erera added a comment -

          I created LUCENE-4647 to simplify CatDocBuilder so that we can make this cutover easily (while still supporting all existing features, e.g. associations).

          Show
          Shai Erera added a comment - I created LUCENE-4647 to simplify CatDocBuilder so that we can make this cutover easily (while still supporting all existing features, e.g. associations).
          Hide
          Shai Erera added a comment -

          Patch cuts FacetFields over to DocValues. Changes included:

          • FacetFields adds a StraightBytesDocValuesFields instead of a payload.
          • Added DocValuesCategoryListIterator which by default pulls a DocValues Source (i.e. not DirectSource) but can be configured otherwise. Perhaps CategoryListParams.createCategoryListIterator should take a boolean to pass down to DVCLI ...
          • Replaced CategoryListParams.Term with String field. So now the CLP only defines the field it should go under (the term was a mistake done long time ago, made to allow control of the term under which the ordinals payload is written).
          • All tests pass. I still didn't add CHANGES, will do so later.

          I must say that with all the recent refactorings done to the facets package, the DV cutover took me literally 5 minutes!

          Mike, I think this is ready for luceneutil .

          Show
          Shai Erera added a comment - Patch cuts FacetFields over to DocValues. Changes included: FacetFields adds a StraightBytesDocValuesFields instead of a payload. Added DocValuesCategoryListIterator which by default pulls a DocValues Source (i.e. not DirectSource) but can be configured otherwise. Perhaps CategoryListParams.createCategoryListIterator should take a boolean to pass down to DVCLI ... Replaced CategoryListParams.Term with String field. So now the CLP only defines the field it should go under (the term was a mistake done long time ago, made to allow control of the term under which the ordinals payload is written). All tests pass. I still didn't add CHANGES, will do so later. I must say that with all the recent refactorings done to the facets package, the DV cutover took me literally 5 minutes! Mike, I think this is ready for luceneutil .
          Hide
          Shai Erera added a comment -

          I also wrote a one-time upgrade utility FacetsPayloadMigrationReader. It's very similar in concept to OrdinalMappingAtomicReader in that it wraps another AtomicReader and exposes the facets payload as a DocValues.Source. It takes in the constructor a mapping from a field->Term which denotes from which Term's payload to read the ordinals and put under the DocValues field. It also supports partitions (that was the harder part).

          I would appreciate if someone reviews it. I wrote a matching test which is very extensive, I think, and covers multiple CLPs, partitions and deleted documents. I will attach it shortly.

          This Reader and Test should go only under the 4x branch, as deprecated.

          There is one nocommit in the patch regarding CategoryListCache. I think that now that we're on DocValues, it's not needed anymore (and also Mike compared DV to it before and the results weren't compelling). So I think we should remove it, but haven't yet done so.

          Show
          Shai Erera added a comment - I also wrote a one-time upgrade utility FacetsPayloadMigrationReader . It's very similar in concept to OrdinalMappingAtomicReader in that it wraps another AtomicReader and exposes the facets payload as a DocValues.Source. It takes in the constructor a mapping from a field->Term which denotes from which Term's payload to read the ordinals and put under the DocValues field. It also supports partitions (that was the harder part). I would appreciate if someone reviews it. I wrote a matching test which is very extensive, I think, and covers multiple CLPs, partitions and deleted documents. I will attach it shortly. This Reader and Test should go only under the 4x branch, as deprecated. There is one nocommit in the patch regarding CategoryListCache. I think that now that we're on DocValues, it's not needed anymore (and also Mike compared DV to it before and the results weren't compelling). So I think we should remove it, but haven't yet done so.
          Hide
          Shai Erera added a comment -

          Test for migration

          Show
          Shai Erera added a comment - Test for migration
          Hide
          Shai Erera added a comment -

          Forgot that I renamed some classes (remove Payload), so this time patch is generated with --show-copies-as-adds.

          Show
          Shai Erera added a comment - Forgot that I renamed some classes (remove Payload), so this time patch is generated with --show-copies-as-adds.
          Hide
          Michael McCandless added a comment -

          I tested this with Wikipedia (avg ~25 ords per doc across 9 dimensions; 2.5M unique ords):

                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                PKLookup      188.75      (5.3%)      195.05      (2.1%)    3.3% (  -3% -   11%)
                HighTerm        3.26      (0.8%)        3.80      (3.4%)   16.3% (  12% -   20%)
                 MedTerm        6.85      (0.8%)        9.14      (3.1%)   33.4% (  29% -   37%)
                 LowTerm       14.45      (1.5%)       24.41      (2.1%)   68.9% (  64% -   73%)
          

          Nice!

          It's odd how the pctg gains are the same for High/Med/LowTerm ... not sure why.

          Show
          Michael McCandless added a comment - I tested this with Wikipedia (avg ~25 ords per doc across 9 dimensions; 2.5M unique ords): Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 188.75 (5.3%) 195.05 (2.1%) 3.3% ( -3% - 11%) HighTerm 3.26 (0.8%) 3.80 (3.4%) 16.3% ( 12% - 20%) MedTerm 6.85 (0.8%) 9.14 (3.1%) 33.4% ( 29% - 37%) LowTerm 14.45 (1.5%) 24.41 (2.1%) 68.9% ( 64% - 73%) Nice! It's odd how the pctg gains are the same for High/Med/LowTerm ... not sure why.
          Hide
          Shai Erera added a comment -

          Indeed nice results, though they are still far from the uber-specialized Collectors you wrote .

          I think that we should see some more gains after we fix LUCENE-4620 (I'm going to do that tomorrow!).

          Also, I want to write a special DGapVIntEncoder/Decoder - that should hopefully somewhat improve decoding time too. I opened LUCENE-4686.

          And I'm not sure if you used PackedInts to encode/decode the ordinals in your previous patches. If you were, then maybe LUCENE-4609 will bring some more improvements.

          Overall, I think this is good progress. With this patch, facets are now on DocValues, supporting all features. There are more optimizations / specializations to do - we should do them separately.

          Mike, I remember in one of our chats we discussed the effectiveness of CategoryListCache. I seem to remember you said it had high RAM consumption, and also it didn't perform well. Do you perhaps have these results? I wonder if you can compare this patch (with in-mem DV) to a CategoryListCache CLI - if the results are not good, we should just nuke it.

          Show
          Shai Erera added a comment - Indeed nice results, though they are still far from the uber-specialized Collectors you wrote . I think that we should see some more gains after we fix LUCENE-4620 (I'm going to do that tomorrow!). Also, I want to write a special DGapVIntEncoder/Decoder - that should hopefully somewhat improve decoding time too. I opened LUCENE-4686 . And I'm not sure if you used PackedInts to encode/decode the ordinals in your previous patches. If you were, then maybe LUCENE-4609 will bring some more improvements. Overall, I think this is good progress. With this patch, facets are now on DocValues, supporting all features. There are more optimizations / specializations to do - we should do them separately. Mike, I remember in one of our chats we discussed the effectiveness of CategoryListCache. I seem to remember you said it had high RAM consumption, and also it didn't perform well. Do you perhaps have these results? I wonder if you can compare this patch (with in-mem DV) to a CategoryListCache CLI - if the results are not good, we should just nuke it.
          Hide
          Shai Erera added a comment -

          Found your comments about CategoryListCache

          I had separately previously tested the existing int[][][] cache (CategoryListCache) but it had smaller gains than this (73% for MedTerm), and it required more RAM (1.9 GB vs 377 RAM for this patch).

          That was on 3 ordinals per document, and already consuming a very large piece of RAM. Also, the gains are not considerable vs the DocValues and PackedBytes versions that you had. And I assume that testing it with 25 ordinals per document is going to be even more costly.
          In addition, CategoryListCache is not per-segment, so if we want to keep it, we'd need to do some major rewriting there. I suggest that we just nuke it for now and come back to it later. The problem with keeping it is that we need to maintain it, and if it's not that much better, I prefer to nuke it.

          Show
          Shai Erera added a comment - Found your comments about CategoryListCache I had separately previously tested the existing int[][][] cache (CategoryListCache) but it had smaller gains than this (73% for MedTerm), and it required more RAM (1.9 GB vs 377 RAM for this patch). That was on 3 ordinals per document, and already consuming a very large piece of RAM. Also, the gains are not considerable vs the DocValues and PackedBytes versions that you had. And I assume that testing it with 25 ordinals per document is going to be even more costly. In addition, CategoryListCache is not per-segment, so if we want to keep it, we'd need to do some major rewriting there. I suggest that we just nuke it for now and come back to it later. The problem with keeping it is that we need to maintain it, and if it's not that much better, I prefer to nuke it.
          Hide
          Michael McCandless added a comment -

          +1 to nuke CategoryListCache.

          Show
          Michael McCandless added a comment - +1 to nuke CategoryListCache.
          Hide
          Michael McCandless added a comment -

          Patch looks good, at least the parts that I understand: +1

          Nice that drill-down fields are now indexed as DOCS_ONLY!!

          Show
          Michael McCandless added a comment - Patch looks good, at least the parts that I understand: +1 Nice that drill-down fields are now indexed as DOCS_ONLY!!
          Hide
          Shai Erera added a comment -

          Patch nukes CategoryListCache. Apparently TotalFacetCounts's API had it all over the place, but all their tests passed null! The cache was tested separately though ...

          Anyway, nuked all relevant stuff, as well as FacetRequest.createCLI (this was there just in case someone set a CLCache on FacetSearchParams) – createCLI is controlled by CLParams. I think that the cache should be controlled by that too .. but if we'll ever add such cache back, we'll have the opportunity to put it there.

          I think this is ready. I'll give it a couple more test runs and then commit.

          Show
          Shai Erera added a comment - Patch nukes CategoryListCache. Apparently TotalFacetCounts's API had it all over the place, but all their tests passed null ! The cache was tested separately though ... Anyway, nuked all relevant stuff, as well as FacetRequest.createCLI (this was there just in case someone set a CLCache on FacetSearchParams) – createCLI is controlled by CLParams. I think that the cache should be controlled by that too .. but if we'll ever add such cache back, we'll have the opportunity to put it there. I think this is ready. I'll give it a couple more test runs and then commit.
          Hide
          Shai Erera added a comment -

          I've decided to add the migration code to trunk as well, because 5.0 is supposed to handle 4x indexes too. Anyway, it doesn't hurt that it's there. I improved the testing more, and added a static utility method which will make using it (doing the migration) easier.

          I beasted some, 'precommit' is happy, so will commit it shortly.

          Show
          Shai Erera added a comment - I've decided to add the migration code to trunk as well, because 5.0 is supposed to handle 4x indexes too. Anyway, it doesn't hurt that it's there. I improved the testing more, and added a static utility method which will make using it (doing the migration) easier. I beasted some, 'precommit' is happy, so will commit it shortly.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1433869

          LUCENE-4602: migrate facets to DocValues

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1433869 LUCENE-4602 : migrate facets to DocValues
          Hide
          Shai Erera added a comment -

          Committed to trunk and 4x. For 4x I add to add @SuppressCodecs("Lucene3x").

          Show
          Shai Erera added a comment - Committed to trunk and 4x. For 4x I add to add @SuppressCodecs("Lucene3x").
          Hide
          Shai Erera added a comment -

          Thanks Mike. This is one great and important milestone for facets!

          Show
          Shai Erera added a comment - Thanks Mike. This is one great and important milestone for facets!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1433878

          LUCENE-4602: migrate facets to DocValues

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1433878 LUCENE-4602 : migrate facets to DocValues
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Shai Erera
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development