Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      CategoryDocumentBuilder is used to add facet fields to a document. Today the usage is not so straightforward, and I'd like to simplify it. First, to improve usage but also to make cutover to DocValues easier.

      This clsas does two operations: (1) adds drill-down terms and (2) creates the fulltree payload. Today, since it does it all on terms, there's a hairy TokenStream which does both these operations in one go. For simplicity, I'd like to break this into two steps:

      1. Write a TokenStream which adds the drill-down terms
        • For no associations, terms should be indexed w/ DOCS_ONLY (see LUCENE-4623).
        • With associations, drill-down terms have payload too.
        • So EnhancementsDocumentBuilder should be able to extend this stream.
      2. Write some API which can be used to build the fulltree payload (i.e. populate a byte[]). Currently that byte[] will be written to a payload and later to DV.

      Hopefully, I'd like to have FacetsDocumentBuilder (maybe just FacetsBuilder?) which only handles things with no associations, and EnhancementsDocBuilder which extends whatever is needed to add associations.

        Issue Links

          Activity

          Hide
          Shai Erera added a comment - - edited

          Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works:

          • FacetFields (previously CategoryDocumentBuilder) takes an Iterable<CategoryPath> and indexes them in two steps:
            1. DrillDownStream indexes the drill-down tokens, e.g. $facets:Author, $facets:Author/Bob
            2. CategoryListBuilder encodes the category ordinals into a Map<String,BytesRef> (more on this later), which is later set as the payload of $facets:$fulltree$
            • Both these steps work per CategoryListParams (in case the application wishes to index groups of facets in different category lists)
          • AssociationsFacetFields (previously EnhancementsDocumentBuilder) extends FacetFields and takes a CategoryAssociationsContainer (which implements Iterable<CategoryPath>) and holds a mapping from a CategoryPath to CategoryAssociation
            • AssociationsDrillDownStream extends DrillDownStream and adds association values to the drill-down tokens' payloads
            • AssociationsCategoryListBuilder extends CategoryListBuilder and encodes <category,association-value> pairs into their own BytesRef
          • CategoryAssociation replaces CategoryEnhancement and simplifies the usage (end-user wise) a lot !
            • Two implementations CategoryIntAssociation and CategoryFloatAssociation support the previously AssociationEnhancement + AssociationInt/FloatProperty and allow associating an int/float value with a category
            • Every CategoryAssociation impl is responsible for serialization of its value to a ByteArrayDataOutput (and de-serialize from ByteArrayDataInput)
            • Every implementation needs to specify its categoryListID, since it determines the term payload under which the association values are encoded
            • The two FacetRequests, AssociationIntSumFacetRequest and AssociationFloatSumFacetRequest, work with CategoryAssociation rather than the enhancement
            • EnhancementsIndexingParams were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages
          • The Map<String,BytesRef> easily supports partitions and associations:
            • When simple categories are indexed, no partitions, a single entry exists in the map
            • When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. $facets:$fulltree$1, $facets:$fulltree$2 etc.
            • When associations are indexed, the map contains the ordinals list (as described above) and the association values per CategoryAssociation.getCategoryListID(), so e.g. an int association is encoded into a different BytesRef than a float one

          I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of CategoryListBuilder to a DocValues field, rather than a TokenStream payload.

          The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above.

          NOTE: the new associations code breaks backwards compatibility with old indexes:

          1. Previously both the int and float associations were written to the same associations list as integers, and the float one used Float.intBitsToFloat and vice versa. Now they are written to two separate lists
          2. Previously the code supported multiple enhancements which affected how they were encoded (e.g. #ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES). But we always had only one enhancement (AssociationEnhancement), so that encoding was really redundant.
          3. In order to support multiple CategoryAssociations per CategoryPath, one can easily write a CompoundAssociation and take care of its serialization.

          Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple CategoryEnhancement types, while only one enhancement was available to our users did not make sense to me.

          Show
          Shai Erera added a comment - - edited Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works: FacetFields (previously CategoryDocumentBuilder ) takes an Iterable<CategoryPath> and indexes them in two steps: DrillDownStream indexes the drill-down tokens, e.g. $facets:Author , $facets:Author/Bob CategoryListBuilder encodes the category ordinals into a Map<String,BytesRef> (more on this later), which is later set as the payload of $facets:$fulltree$ Both these steps work per CategoryListParams (in case the application wishes to index groups of facets in different category lists) AssociationsFacetFields (previously EnhancementsDocumentBuilder ) extends FacetFields and takes a CategoryAssociationsContainer (which implements Iterable<CategoryPath> ) and holds a mapping from a CategoryPath to CategoryAssociation AssociationsDrillDownStream extends DrillDownStream and adds association values to the drill-down tokens' payloads AssociationsCategoryListBuilder extends CategoryListBuilder and encodes <category,association-value> pairs into their own BytesRef CategoryAssociation replaces CategoryEnhancement and simplifies the usage (end-user wise) a lot ! Two implementations CategoryIntAssociation and CategoryFloatAssociation support the previously AssociationEnhancement + AssociationInt/FloatProperty and allow associating an int/float value with a category Every CategoryAssociation impl is responsible for serialization of its value to a ByteArrayDataOutput (and de-serialize from ByteArrayDataInput ) Every implementation needs to specify its categoryListID , since it determines the term payload under which the association values are encoded The two FacetRequests , AssociationIntSumFacetRequest and AssociationFloatSumFacetRequest , work with CategoryAssociation rather than the enhancement EnhancementsIndexingParams were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages The Map<String,BytesRef> easily supports partitions and associations: When simple categories are indexed, no partitions, a single entry exists in the map When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. $facets:$fulltree$1 , $facets:$fulltree$2 etc. When associations are indexed, the map contains the ordinals list (as described above) and the association values per CategoryAssociation.getCategoryListID() , so e.g. an int association is encoded into a different BytesRef than a float one I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of CategoryListBuilder to a DocValues field, rather than a TokenStream payload. The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above. NOTE: the new associations code breaks backwards compatibility with old indexes: Previously both the int and float associations were written to the same associations list as integers, and the float one used Float.intBitsToFloat and vice versa. Now they are written to two separate lists Previously the code supported multiple enhancements which affected how they were encoded (e.g. #ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES ). But we always had only one enhancement ( AssociationEnhancement ), so that encoding was really redundant. In order to support multiple CategoryAssociations per CategoryPath , one can easily write a CompoundAssociation and take care of its serialization. Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple CategoryEnhancement types, while only one enhancement was available to our users did not make sense to me.
          Hide
          Sivan Yogev added a comment -

          As the author of most of the code being removed by this patch I think the new API is much better than the previous

          Show
          Sivan Yogev added a comment - As the author of most of the code being removed by this patch I think the new API is much better than the previous
          Hide
          Michael McCandless added a comment -

          This looks awesome Shai!

          I like the new name (FacetFields.addFields(..)), and I like how
          FacetFields is now stateless (ie you can use a single instance across
          threads adding documents), and this is a great first step to make the
          DocValues cutover easier.

          Show
          Michael McCandless added a comment - This looks awesome Shai! I like the new name (FacetFields.addFields(..)), and I like how FacetFields is now stateless (ie you can use a single instance across threads adding documents), and this is a great first step to make the DocValues cutover easier.
          Hide
          Shai Erera added a comment -

          I like how FacetFields is now stateless (ie you can use a single instance across threads adding documents)

          Notice that I put a TODO to make this class reusable. I.e. currently it just allocates a new TokenStream and Field instances on every addFields (like CategoryDocumentBuilder did before).
          If we'll do that, it will have state, but will be reusable per-thread.

          With those warming comments, I will commit these changes .

          Show
          Shai Erera added a comment - I like how FacetFields is now stateless (ie you can use a single instance across threads adding documents) Notice that I put a TODO to make this class reusable. I.e. currently it just allocates a new TokenStream and Field instances on every addFields (like CategoryDocumentBuilder did before). If we'll do that, it will have state, but will be reusable per-thread. With those warming comments, I will commit these changes .
          Hide
          Shai Erera added a comment -

          Committed changes to trunk and 4x. I'm sure there will be more improvements / simplifications to that API, we can do them incrementally.

          Show
          Shai Erera added a comment - Committed changes to trunk and 4x. I'm sure there will be more improvements / simplifications to that API, we can do them incrementally.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1428324

          LUCENE-4647: Simplify CategoryDocumentBuilder

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1428324 LUCENE-4647 : Simplify CategoryDocumentBuilder
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1428320

          LUCENE-4647: Simplify CategoryDocumentBuilder

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1428320 LUCENE-4647 : Simplify CategoryDocumentBuilder

            People

            • Assignee:
              Shai Erera
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development