Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10292

Add cartesian Streaming Expression to build cartesian products from multi-value fields and text fields

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Currently all the Streaming Expression such as rollups, intersections, fetch etc, work on single value fields. The cartesian expression would create a stream of tuples from a single tuple with a multi-value field. This would allow multi-valued fields to be operated on by the wider library of Streaming Expression.

      For example a single tuple with a multi-valued field:

      id: 1
      author: [Jim, Jack, Steve]

      Would be transformed in the following three tuples:

      id:1
      author:Jim

      id:1
      author:Jack

      id:1
      author:Steve

      1. SOLR-10292.patch
        31 kB
        Dennis Gove
      2. SOLR-10292.patch
        29 kB
        Dennis Gove
      3. SOLR-10292.patch
        22 kB
        Dennis Gove
      4. SOLR-10292.patch
        20 kB
        Dennis Gove
      5. SOLR-10292.patch
        12 kB
        Dennis Gove

        Activity

        Hide
        dpgove Dennis Gove added a comment - - edited

        I'm not a huge fan of the function name, but feature-wise I think this would be good.

        cartesian(
          <stream>,
          by="field1,field2",
          sort="field1 ASC"
        )
        

        1. Supports any incoming stream
        2. Allows you to do the product over multiple fields
        3. Allows you to indicate a sort order for the new tuples, if not provided will default to incoming order of values in by fields
        4. If a non-array exists in the by fields then this will just return that single tuple, no need to error out. Allows for fields where a mixture of multi and single valued exist.

        Anything I'm missing?

        Show
        dpgove Dennis Gove added a comment - - edited I'm not a huge fan of the function name, but feature-wise I think this would be good. cartesian( <stream>, by= "field1,field2" , sort= "field1 ASC" ) 1. Supports any incoming stream 2. Allows you to do the product over multiple fields 3. Allows you to indicate a sort order for the new tuples, if not provided will default to incoming order of values in by fields 4. If a non-array exists in the by fields then this will just return that single tuple, no need to error out. Allows for fields where a mixture of multi and single valued exist. Anything I'm missing?
        Hide
        joel.bernstein Joel Bernstein added a comment -

        I was basing the name off of this:
        http://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html#cartesian-product-transformation

        One of the things I was considering is making the cartesian function be the opposite of reduce. In this scenario we allow for cartesian operations that would take in a single tuple and return an array of tuples. The tuples would then be streamed from the cartesian function rather then returning the array.

        This would allow us to add cartesian operations on text fields which would use tokenizers to emit sentences, shingles, key phrases etc..

        Show
        joel.bernstein Joel Bernstein added a comment - I was basing the name off of this: http://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html#cartesian-product-transformation One of the things I was considering is making the cartesian function be the opposite of reduce . In this scenario we allow for cartesian operations that would take in a single tuple and return an array of tuples. The tuples would then be streamed from the cartesian function rather then returning the array. This would allow us to add cartesian operations on text fields which would use tokenizers to emit sentences, shingles, key phrases etc..
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        Here is a potential pipeline:

        cartesian(cartesian(expr, 
                            sentences(field)), 
                  keyPhrases(field))
        

        The inner cartesian emits tuples with sentences and the outer cartesian emits tuples with keyPhrases.

        Show
        joel.bernstein Joel Bernstein added a comment - - edited Here is a potential pipeline: cartesian(cartesian(expr, sentences(field)), keyPhrases(field)) The inner cartesian emits tuples with sentences and the outer cartesian emits tuples with keyPhrases.
        Hide
        dpgove Dennis Gove added a comment -

        I like this. Also, a new set of evaluator types which return arrays/lists.

        Show
        dpgove Dennis Gove added a comment - I like this. Also, a new set of evaluator types which return arrays/lists.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        Here are some text cartesian operations that might be interesting to support:

        1) Regex extraction
        2) Sentence
        3) Key phrases
        4) Shingles

        Show
        joel.bernstein Joel Bernstein added a comment - Here are some text cartesian operations that might be interesting to support: 1) Regex extraction 2) Sentence 3) Key phrases 4) Shingles
        Hide
        dpgove Dennis Gove added a comment -

        Implements everything except the read() function.

        Expression is

        cartesian(
          <stream>,
          <fieldName | evaluator> [as newName],
          <fieldName | evaluator> [as newName],
          [productSort="how to order new tuples"
        )
        

        1. Will create a tuple for each value in the field, and return in the order the values appear in the field

        cartesian(
          <stream>,
          multivaluedField
        )
        

        2. Will create a tuple for each value in the field, and return in the order of the ascending order of the values in the field

        cartesian(
          <stream>,
          multivaluedField,
          productSort="multivaluedField ASC"
        )
        

        3. Will create a tuple for each value in the evaluated expression, putting the value in the same fieldName, and return new tuples in ascending order of the evaluated values

        cartesian(
          <stream>,
          sentence(fieldA) as fieldA,
          productSort="fieldA ASC"
        )
        

        4. Will create a tuple for each value in evaluated regex and sentence

        cartesian(
          <stream>,
          sentence(fieldA) as newField,
          regexGroups(fieldB, "some regex expression generating groups") as fieldB
          productSort="fieldB ASC, newField DESC"
        )
        
        Show
        dpgove Dennis Gove added a comment - Implements everything except the read() function. Expression is cartesian( <stream>, <fieldName | evaluator> [as newName], <fieldName | evaluator> [as newName], [productSort= "how to order new tuples" ) 1. Will create a tuple for each value in the field, and return in the order the values appear in the field cartesian( <stream>, multivaluedField ) 2. Will create a tuple for each value in the field, and return in the order of the ascending order of the values in the field cartesian( <stream>, multivaluedField, productSort= "multivaluedField ASC" ) 3. Will create a tuple for each value in the evaluated expression, putting the value in the same fieldName, and return new tuples in ascending order of the evaluated values cartesian( <stream>, sentence(fieldA) as fieldA, productSort= "fieldA ASC" ) 4. Will create a tuple for each value in evaluated regex and sentence cartesian( <stream>, sentence(fieldA) as newField, regexGroups(fieldB, "some regex expression generating groups" ) as fieldB productSort= "fieldB ASC, newField DESC" )
        Hide
        dpgove Dennis Gove added a comment -

        Also, in search of a better parameter name than "productSort". I want it to imply that this is only creating a sort on the new tuples and does not resort the entire stream.

        Show
        dpgove Dennis Gove added a comment - Also, in search of a better parameter name than "productSort". I want it to imply that this is only creating a sort on the new tuples and does not resort the entire stream.
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        We could skip the productSort all together and rely on the sort() expression.

        Patch looks great. Looks like we just need to implement a few cartesian evaluators. Regex might be the easiest place to start as it doesn't rely on an analyzer.

        Show
        joel.bernstein Joel Bernstein added a comment - - edited We could skip the productSort all together and rely on the sort() expression. Patch looks great. Looks like we just need to implement a few cartesian evaluators. Regex might be the easiest place to start as it doesn't rely on an analyzer.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        For the read() implementation I think each cartesian evaluator's product should be based on the original tuple, not on the tuples created by the proceeding cartesian evaluators. Otherwise the final product will be hard to understand.

        Show
        joel.bernstein Joel Bernstein added a comment - For the read() implementation I think each cartesian evaluator's product should be based on the original tuple, not on the tuples created by the proceeding cartesian evaluators. Otherwise the final product will be hard to understand.
        Hide
        dpgove Dennis Gove added a comment -

        My concern with relying on sort() is that it requires reading all tuples before doing the sort. If we're just providing a way to order the generated tuples I think using a sort() stream would be too costly.

        I agree that the result of each evaluator will be based on the original tuple. But if multiple evaluators, e1 and e2, are used then the resulting tuples will look like

        {
          fieldA : e1[0],
          fieldB : e2[0],
          <other fields> 
        },
        {
          fieldA : e1[0],
          fieldB : e2[1],
          <other fields> 
        },
        {
          fieldA : e1[0],
          fieldB : e2[2],
          <other fields> 
        },
        {
          fieldA : e1[1],
          fieldB : e2[0],
          <other fields> 
        },
        {
          fieldA : e1[1],
          fieldB : e2[1],
          <other fields> 
        },
        {
          fieldA : e1[1],
          fieldB : e2[2],
          <other fields> 
        },
        
        Show
        dpgove Dennis Gove added a comment - My concern with relying on sort() is that it requires reading all tuples before doing the sort. If we're just providing a way to order the generated tuples I think using a sort() stream would be too costly. I agree that the result of each evaluator will be based on the original tuple. But if multiple evaluators, e1 and e2, are used then the resulting tuples will look like { fieldA : e1[0], fieldB : e2[0], <other fields> }, { fieldA : e1[0], fieldB : e2[1], <other fields> }, { fieldA : e1[0], fieldB : e2[2], <other fields> }, { fieldA : e1[1], fieldB : e2[0], <other fields> }, { fieldA : e1[1], fieldB : e2[1], <other fields> }, { fieldA : e1[1], fieldB : e2[2], <other fields> },
        Hide
        dpgove Dennis Gove added a comment -

        Includes an implementation of read().

        Also, updates FieldEvaluator to support getting multi-valued fields from the tuple. For simplicity in all evaluators, if FieldEvaluator finds an Object[] (object array) it will convert that into an ArrayList (preserving value order). This allows us to only have to check for Collection in evaluators or streams and not have to worry about object arrays.

        I don't have any tests yet so I'm crossing my fingers the logic is playing out as I expect it to.

        Show
        dpgove Dennis Gove added a comment - Includes an implementation of read(). Also, updates FieldEvaluator to support getting multi-valued fields from the tuple. For simplicity in all evaluators, if FieldEvaluator finds an Object[] (object array) it will convert that into an ArrayList (preserving value order). This allows us to only have to check for Collection in evaluators or streams and not have to worry about object arrays. I don't have any tests yet so I'm crossing my fingers the logic is playing out as I expect it to.
        Hide
        dpgove Dennis Gove added a comment -

        Fixes a missed case in FieldEvaluator.

        Now this turns both object arrays and Iterables which are not lists into ArrayLists.

        Show
        dpgove Dennis Gove added a comment - Fixes a missed case in FieldEvaluator. Now this turns both object arrays and Iterables which are not lists into ArrayLists.
        Hide
        dpgove Dennis Gove added a comment - - edited

        Tests added and passing.

        This does not add any additional evaluators. I think those can be added in other tickets. All evaluators are supported by this stream so anything you think to add (regex matching, sentence creation, etc...) will work. The stream works with both multi-valued and single-valued fields in so much that it will treat single-valued fields as a collection with a single item.

        Show
        dpgove Dennis Gove added a comment - - edited Tests added and passing. This does not add any additional evaluators. I think those can be added in other tickets. All evaluators are supported by this stream so anything you think to add (regex matching, sentence creation, etc...) will work. The stream works with both multi-valued and single-valued fields in so much that it will treat single-valued fields as a collection with a single item.
        Hide
        dpgove Dennis Gove added a comment -

        I think this is ready to go. I've decided to be explicit and register it under the function name 'cartesianProduct'.

        Full suite of tests and precommit pass.

        Show
        dpgove Dennis Gove added a comment - I think this is ready to go. I've decided to be explicit and register it under the function name 'cartesianProduct'. Full suite of tests and precommit pass.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 9738d34fb130924d144c489212c3cc8b915a11d0 in lucene-solr's branch refs/heads/branch_6x from Dennis Gove
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9738d34 ]

        SOLR-10292: Adds CartesianProductStream to turn multivalued fields into multiple tuples

        Show
        jira-bot ASF subversion and git services added a comment - Commit 9738d34fb130924d144c489212c3cc8b915a11d0 in lucene-solr's branch refs/heads/branch_6x from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9738d34 ] SOLR-10292 : Adds CartesianProductStream to turn multivalued fields into multiple tuples
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 92297b58605104106b5b31d3dae5c2daed1886ba in lucene-solr's branch refs/heads/master from Dennis Gove
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=92297b5 ]

        SOLR-10292: Adds CartesianProductStream to turn multivalued fields into multiple tuples

        Show
        jira-bot ASF subversion and git services added a comment - Commit 92297b58605104106b5b31d3dae5c2daed1886ba in lucene-solr's branch refs/heads/master from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=92297b5 ] SOLR-10292 : Adds CartesianProductStream to turn multivalued fields into multiple tuples
        Hide
        varunthacker Varun Thacker added a comment -

        Hi Dennis,

        Can we close out this issue with "Fix Version as 6.6"
        Secondly on master shouldn't this entry be under "6.6" instead of "7.0.0" ? It's fine on branch_6x

        Show
        varunthacker Varun Thacker added a comment - Hi Dennis, Can we close out this issue with "Fix Version as 6.6" Secondly on master shouldn't this entry be under "6.6" instead of "7.0.0" ? It's fine on branch_6x
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a622568979ed0b84fe40174fe8b219599c15b72c in lucene-solr's branch refs/heads/master from Dennis Gove
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a622568 ]

        SOLR-10292: Moves new feature description in solr/CHANGES.txt to the correct version

        Show
        jira-bot ASF subversion and git services added a comment - Commit a622568979ed0b84fe40174fe8b219599c15b72c in lucene-solr's branch refs/heads/master from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a622568 ] SOLR-10292 : Moves new feature description in solr/CHANGES.txt to the correct version
        Hide
        dpgove Dennis Gove added a comment -

        Varun Thacker, thank you. I've corrected this.

        Show
        dpgove Dennis Gove added a comment - Varun Thacker , thank you. I've corrected this.

          People

          • Assignee:
            Unassigned
            Reporter:
            joel.bernstein Joel Bernstein
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development