Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7584

Add Joins to the Streaming API and Streaming Expressions

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: SolrJ
    • Labels:

      Description

      Add InnerJoinStream, LeftOuterJoinStream, and supporting classes to the Streaming API to allow for joining between sub-streams.

      At its basic, it would look something like this

      innerJoin(
        search(collection1, q=*:*, fl="fieldA, fieldB, fieldC", ...),
        search(collection2, q=*:*, fl="fieldA, fieldD, fieldE", ...),
        on="fieldA=fieldA"
      )
      

      or with multi-field on clauses

      innerJoin(
        search(collection1, q=*:*, fl="fieldA, fieldB, fieldC", ...),
        search(collection2, q=*:*, fl="fieldA, fieldD, fieldE", ...),
        on="fieldA=fieldA, fieldB=fieldD"
      )
      

      I'd also like to support the option of doing a hash join instead of the default merge join but I haven't yet figured out the best way to express that. I'd like to let the user tell us which sub-stream should be hashed (the least-cost one).

      Also, I've been thinking about field aliasing and might want to add a SelectStream which serves the purpose of allowing us to limit the fields coming out and rename fields.

      Depends on SOLR-7554

      1. SOLR-7584.patch
        55 kB
        Dennis Gove
      2. SOLR-7584.patch
        54 kB
        Dennis Gove
      3. SOLR-7584.patch
        50 kB
        Dennis Gove
      4. SOLR-7584.patch
        51 kB
        Dennis Gove
      5. SOLR-7584.patch
        51 kB
        Dennis Gove
      6. SOLR-7584.patch
        41 kB
        Dennis Gove

        Issue Links

          Activity

          Hide
          dpgove Dennis Gove added a comment - - edited

          Adds abstract JoinStream to support joins of N sub-streams.
          Adds abstract BiJoinStream to limit JoinStream to 2 sub-streams, left and right.
          Adds concrete InnerJoinStream with support for merge join.

          Does not handle hash joins.
          Uses aliasing concept already available in CloudSolrStream.

          Still work to be done.

          Show
          dpgove Dennis Gove added a comment - - edited Adds abstract JoinStream to support joins of N sub-streams. Adds abstract BiJoinStream to limit JoinStream to 2 sub-streams, left and right. Adds concrete InnerJoinStream with support for merge join. Does not handle hash joins. Uses aliasing concept already available in CloudSolrStream. Still work to be done.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Syntax looks really good.

          Had a brief look at the implementation to checkout the outer join syntax. Looks like you may have left the LeftOuterJoinStream out of the patch?

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Syntax looks really good. Had a brief look at the implementation to checkout the outer join syntax. Looks like you may have left the LeftOuterJoinStream out of the patch?
          Hide
          dpgove Dennis Gove added a comment -

          That's right. LeftOuterJoin wasn't included in the first version of the patch. At the moment the patch includes changes to a set of supporting classes and adds inner join. Left outer join isn't ready yet. I expect the expression syntax to be the same (two streams with an on clause) and the implementation to be fairly similar to inner join but taking into account that a right-side record isn't required for the left-side record to be returned.

          Show
          dpgove Dennis Gove added a comment - That's right. LeftOuterJoin wasn't included in the first version of the patch. At the moment the patch includes changes to a set of supporting classes and adds inner join. Left outer join isn't ready yet. I expect the expression syntax to be the same (two streams with an on clause) and the implementation to be fairly similar to inner join but taking into account that a right-side record isn't required for the left-side record to be returned.
          Hide
          dpgove Dennis Gove added a comment -

          Adds LeftOuterJoinStream to support left outer joins w/tests (work done by Corey Wu).
          Moves some functions from InnerJoinStream up to parent classes as they are shared in LeftOuterJoinStream.

          Show
          dpgove Dennis Gove added a comment - Adds LeftOuterJoinStream to support left outer joins w/tests (work done by Corey Wu). Moves some functions from InnerJoinStream up to parent classes as they are shared in LeftOuterJoinStream.
          Hide
          dpgove Dennis Gove added a comment -

          Missed a single line in my diff that corrected a throw statement. Sorry for the double upload.

          Show
          dpgove Dennis Gove added a comment - Missed a single line in my diff that corrected a throw statement. Sorry for the double upload.
          Hide
          dpgove Dennis Gove added a comment -

          Recreated patch off current trunk. Previous patch was a little outdated.

          Show
          dpgove Dennis Gove added a comment - Recreated patch off current trunk. Previous patch was a little outdated.
          Hide
          sharathrayapati Nagasharath added a comment -

          Does this support join on faceting as well?

          Can we apply functions like sum and avg on the joined data?

          Show
          sharathrayapati Nagasharath added a comment - Does this support join on faceting as well? Can we apply functions like sum and avg on the joined data?
          Hide
          dpgove Dennis Gove added a comment - - edited

          This supports joining any incoming set of streams. If you have a FacetStream instance (SOLR-7903) then you could absolutely join it with some other stream instance.

          Due to current use of merge-join style it is a requirement that the incoming streams be sorted in a similar order. That said, a hash-join style can relatively easily be added in which case the ordering requirement will go away. I think a hash-join would make a lot of sense for a FacetStream (or really any kind of aggregation stream).

          The result of the join is just another stream so you can then feed that into any other stream for further processing (including aggregation for functions like sum and avg).

          Show
          dpgove Dennis Gove added a comment - - edited This supports joining any incoming set of streams. If you have a FacetStream instance ( SOLR-7903 ) then you could absolutely join it with some other stream instance. Due to current use of merge-join style it is a requirement that the incoming streams be sorted in a similar order. That said, a hash-join style can relatively easily be added in which case the ordering requirement will go away. I think a hash-join would make a lot of sense for a FacetStream (or really any kind of aggregation stream). The result of the join is just another stream so you can then feed that into any other stream for further processing (including aggregation for functions like sum and avg).
          Hide
          dpgove Dennis Gove added a comment -

          Part of this ticket is a change in comparators and equalitors to support differing field names on either side of the comparison (ie, fieldA = fieldB). Due to changes that have come into trunk between the creation of this patch and now it was required that I propagate those changes to a couple of other files.

          Note, I originally included this change in SOLR-7669 but realized today that it's actually necessary in this patch. Here's me regretting the decision to not create a separate ticket for the equalitor/comparator changes but this patch does also add support for distributed joins so there's that. Either way, description of change is below.

          Required a couple of changes in the SQL and FacetStream areas related to FieldComparator. The FieldComparator has been changed to support different field names on the left and right side. The SQL and FacetStream areas use FieldComparator for sorting (a totally valid use case) but do expect the left and right side field names to be equal. The changes I made go through and validate that assumption.

          In the future I think I may circle back around and create a new FieldComparator with a single field name so that on construction that assumption can be enforced.

          All tests pass.

          Show
          dpgove Dennis Gove added a comment - Part of this ticket is a change in comparators and equalitors to support differing field names on either side of the comparison (ie, fieldA = fieldB). Due to changes that have come into trunk between the creation of this patch and now it was required that I propagate those changes to a couple of other files. Note, I originally included this change in SOLR-7669 but realized today that it's actually necessary in this patch. Here's me regretting the decision to not create a separate ticket for the equalitor/comparator changes but this patch does also add support for distributed joins so there's that. Either way, description of change is below. Required a couple of changes in the SQL and FacetStream areas related to FieldComparator. The FieldComparator has been changed to support different field names on the left and right side. The SQL and FacetStream areas use FieldComparator for sorting (a totally valid use case) but do expect the left and right side field names to be equal. The changes I made go through and validate that assumption. In the future I think I may circle back around and create a new FieldComparator with a single field name so that on construction that assumption can be enforced. All tests pass.
          Hide
          dpgove Dennis Gove added a comment -

          Could you describe your use-case for joining on facets? I can imagine that a HashJoin (SOLR-8188) would be good for something like that because it removes the sort requirement.

          Yes, you can apply functions like sum and average on the joined data by wrapping the resulting joined stream in a RollupStream and using metrics.

          Show
          dpgove Dennis Gove added a comment - Could you describe your use-case for joining on facets? I can imagine that a HashJoin ( SOLR-8188 ) would be good for something like that because it removes the sort requirement. Yes, you can apply functions like sum and average on the joined data by wrapping the resulting joined stream in a RollupStream and using metrics.
          Hide
          dpgove Dennis Gove added a comment -

          Rebased against current trunk. A couple of comment changes. All tests pass.

          Show
          dpgove Dennis Gove added a comment - Rebased against current trunk. A couple of comment changes. All tests pass.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1713753 from dpgove@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1713753 ]

          SOLR-7584: Adds Inner and LeftOuter Joins to the Streaming API and Streaming Expressions (Dennis Gove, Corey Wu)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1713753 from dpgove@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1713753 ] SOLR-7584 : Adds Inner and LeftOuter Joins to the Streaming API and Streaming Expressions (Dennis Gove, Corey Wu)

            People

            • Assignee:
              dpgove Dennis Gove
              Reporter:
              dpgove Dennis Gove
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development