Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8925

Add gatherNodes Streaming Expression to support breadth first traversals

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 6.1
    • Component/s: None
    • Labels:
      None

      Description

      The gatherNodes Streaming Expression is a flexible general purpose breadth first graph traversal. It uses the same parallel join under the covers as (SOLR-8888) but is much more generalized and can be used for a wide range of use cases.

      Sample syntax:

      
       gatherNodes(friends,
                   gatherNodes(friends,
                               search(articles, q=“body:(queryA)”, fl=“author”),
                               walk ="author->user”,
                               gather="friend"),
                   walk=“friend->user”,
                   gather="friend",
                   scatter=“branches, leaves”)
      

      The expression above is evaluated as follows:

      1) The inner search() expression is evaluated on the articles collection, emitting a Stream of Tuples with the author field populated.
      2) The inner gatherNodes() expression reads the Tuples form the search() stream and traverses to the friends collection by performing a distributed join between articles.author and friends.user field. It gathers the value from the friend field during the join.
      3) The inner gatherNodes() expression then emits the friend Tuples. By default the gatherNodes function emits only the leaves which in this case are the friend tuples.
      4) The outer gatherNodes() expression reads the friend Tuples and Traverses again in the "friends" collection, this time performing the join between friend Tuples emitted in step 3. This collects the friend of friends.
      5) The outer gatherNodes() expression emits the entire graph that was collected. This is controlled by the "scatter" parameter. In the example the root nodes are the authors, the branches are the author's friends and the leaves are the friend of friends.

      This traversal is fully distributed and cross collection.

      Aggregations are also supported during the traversal. This can be useful for making recommendations based on co-occurance counts: Sample syntax:

      top(
            gatherNodes(baskets,
                        search(baskets, q=“prodid:X”, fl=“basketid”, rows=“500”, sort=“random_7897987 asc”),
                        walk =“basketid->basketid”,
                        gather=“prodid”,
                        fl=“prodid, price”,
                        count(*),
                        avg(price)),
            n=4,
            sort=“count(*) desc, avg(price) asc”)
      

      In the expression above, the inner search() function searches the basket collection for 500 random basketId's that have the prodid X.

      gatherNodes then traverses the basket collection and gathers all the prodid's for the selected basketIds.
      It also aggregates the counts and average price for each productid collected. The count reflects the co-occurance count for each prodid gathered and prodid X. The outer top expression selects the top 4 prodid's emitted from gatherNodes, based the co-occurance count and avg price.

      Like all streaming expressions the gatherNodes expression can be combined with other streaming expressions. For example the following expression uses a hashJoin to intersect the network of friends rooted to authors found with different queries:

      hashInnerJoin(
                            gatherNodes(friends,
                                        gatherNodes(friends,
                                                    search(articles, q=“body:(queryA)”, fl=“author”),
                                                    walk ="author->user”,
                                                    gather="friend"),
                                        walk=“friend->user”,
                                        gather="friend",
                                        scatter=“branches, leaves”),
                             gatherNodes(friends,
                                        gatherNodes(friends,
                                                    search(articles, q=“body:(queryB)”, fl=“author”),
                                                    walk ="author->user”,
                                                    gather="friend"),
                                        walk=“friend->user”,
                                        gather="friend",
                                        scatter=“branches, leaves”),
                            on=“friend”
               )
      
      1. SOLR-8925.patch
        51 kB
        Joel Bernstein
      2. SOLR-8925.patch
        49 kB
        Joel Bernstein
      3. SOLR-8925.patch
        46 kB
        Joel Bernstein
      4. SOLR-8925.patch
        45 kB
        Joel Bernstein
      5. SOLR-8925.patch
        38 kB
        Joel Bernstein
      6. SOLR-8925.patch
        30 kB
        Joel Bernstein
      7. SOLR-8925.patch
        29 kB
        Joel Bernstein
      8. SOLR-8925.patch
        24 kB
        Joel Bernstein

        Issue Links

          Activity

          Hide
          dpgove Dennis Gove added a comment - - edited

          I like this. Just a couple of questions.

          1. What does this do with duplicate nodes? ie, overlapping friend networks. Will it prune those out, show the node twice, mark a node has having multiple sources?

          2. When using the scatter parameter will the nodes be marked as which group they fall into? What if a node falls into multiple groups (kinda related to #1 above)?

          3. Will a node include information about its source, ie - why it's included in a graph?

          4. If gatherNodes is doing a 'join' between friends and articles I'd expect the tuple to be a join of the tuple found in articles and the tuple found in friends. But if "The inner gatherNodes() expression then emits the friend Tuples" I believe this is more of an intersect. Ie, give me tuples in friends which also appear in articles, using the author->user equalitor. Though I guess it would be returning tuples from both the left and right streams whereas a standard intersect only returns tuples from the left stream. That said, it's not joining those tuples together.

          5. What could one do if they wished to build a graph using a subset of data in friends collection? Can they apply a filter on friends as part of the gatherNodes function? Perhaps they could be allowed to add fq filters.

          Show
          dpgove Dennis Gove added a comment - - edited I like this. Just a couple of questions. 1. What does this do with duplicate nodes? ie, overlapping friend networks. Will it prune those out, show the node twice, mark a node has having multiple sources? 2. When using the scatter parameter will the nodes be marked as which group they fall into? What if a node falls into multiple groups (kinda related to #1 above)? 3. Will a node include information about its source, ie - why it's included in a graph? 4. If gatherNodes is doing a 'join' between friends and articles I'd expect the tuple to be a join of the tuple found in articles and the tuple found in friends. But if "The inner gatherNodes() expression then emits the friend Tuples" I believe this is more of an intersect. Ie, give me tuples in friends which also appear in articles, using the author->user equalitor. Though I guess it would be returning tuples from both the left and right streams whereas a standard intersect only returns tuples from the left stream. That said, it's not joining those tuples together. 5. What could one do if they wished to build a graph using a subset of data in friends collection? Can they apply a filter on friends as part of the gatherNodes function? Perhaps they could be allowed to add fq filters.
          Hide
          dpgove Dennis Gove added a comment -

          The order in the walk parameter might be confusing.

          walk ="author->user”,
          

          In other expressions where we're checking equality between two streams we use a standard of firstStreamField = secondStreamField. In gatherNodes, the field on the right appears to go with the first stream while the field on the left goes with the second stream. I'm not suggesting I don't like the author->user structure, because I do, but perhaps that the use of collection as the first param might lead to confusion.

          Show
          dpgove Dennis Gove added a comment - The order in the walk parameter might be confusing. walk ="author->user”, In other expressions where we're checking equality between two streams we use a standard of firstStreamField = secondStreamField. In gatherNodes, the field on the right appears to go with the first stream while the field on the left goes with the second stream. I'm not suggesting I don't like the author->user structure, because I do, but perhaps that the use of collection as the first param might lead to confusion.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          What does this do with duplicate nodes? ie, overlapping friend networks. Will it prune those out, show the node twice, mark a node has having multiple sources?

          Duplicate nodes are removed by cycle detection. All the ancestors are tracked and are returned with the Tuple in the ancestors field.

          When using the scatter parameter will the nodes be marked as which group they fall into? What if a node falls into multiple groups (kinda related to #1 above)?

          nodes will be marked with the level of the traversal and the collection they came from.

          If gatherNodes is doing a 'join' between friends and articles I'd expect the tuple to be a join of the tuple found in articles and the tuple found in friends. But if "The inner gatherNodes() expression then emits the friend Tuples" I believe this is more of an intersect. Ie, give me tuples in friends which also appear in articles, using the author->user equalitor. Though I guess it would be returning tuples from both the left and right streams whereas a standard intersect only returns tuples from the left stream. That said, it's not joining those tuples together.

          It's a join but not similar to the other joins expressions which are done with a single search for the left and right streams. This a parallel batched nested loop join. So I'm not sure it expresses quite like the other joins. You can see the implementation in the ShortestPathStream. Looking at the implementation might spark some ideas of how to express it. I'm open to ideas.

          What could one do if they wished to build a graph using a subset of data in friends collection? Can they apply a filter on friends as part of the gatherNodes function? Perhaps they could be allowed to add fq filters.

          The fq,and fl params will be supported. This will support filtering and listing/aggregating edge properties.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited What does this do with duplicate nodes? ie, overlapping friend networks. Will it prune those out, show the node twice, mark a node has having multiple sources? Duplicate nodes are removed by cycle detection. All the ancestors are tracked and are returned with the Tuple in the ancestors field. When using the scatter parameter will the nodes be marked as which group they fall into? What if a node falls into multiple groups (kinda related to #1 above)? nodes will be marked with the level of the traversal and the collection they came from. If gatherNodes is doing a 'join' between friends and articles I'd expect the tuple to be a join of the tuple found in articles and the tuple found in friends. But if "The inner gatherNodes() expression then emits the friend Tuples" I believe this is more of an intersect. Ie, give me tuples in friends which also appear in articles, using the author->user equalitor. Though I guess it would be returning tuples from both the left and right streams whereas a standard intersect only returns tuples from the left stream. That said, it's not joining those tuples together. It's a join but not similar to the other joins expressions which are done with a single search for the left and right streams. This a parallel batched nested loop join. So I'm not sure it expresses quite like the other joins. You can see the implementation in the ShortestPathStream. Looking at the implementation might spark some ideas of how to express it. I'm open to ideas. What could one do if they wished to build a graph using a subset of data in friends collection? Can they apply a filter on friends as part of the gatherNodes function? Perhaps they could be allowed to add fq filters. The fq,and fl params will be supported. This will support filtering and listing/aggregating edge properties.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I think the direction of the traversal is interesting. The direction of the traversal is to walk or traverse from author to user.

          It might make sense to not think of a traversal as a join, but as it's own operation.

          Show
          joel.bernstein Joel Bernstein added a comment - I think the direction of the traversal is interesting. The direction of the traversal is to walk or traverse from author to user. It might make sense to not think of a traversal as a join, but as it's own operation.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          First patch also contains support for aggregations on the nodes during the traversal. Needs test cases.

          Show
          joel.bernstein Joel Bernstein added a comment - First patch also contains support for aggregations on the nodes during the traversal. Needs test cases.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Patch with first very simple first test case. Shows the basic machinery working.

          Show
          joel.bernstein Joel Bernstein added a comment - Patch with first very simple first test case. Shows the basic machinery working.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Patch with count metric working.

          Show
          joel.bernstein Joel Bernstein added a comment - Patch with count metric working.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch with some nice syntax improvements and more tests.

          Show
          joel.bernstein Joel Bernstein added a comment - New patch with some nice syntax improvements and more tests.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch adds more test cases.

          Show
          joel.bernstein Joel Bernstein added a comment - New patch adds more test cases.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          More tests and removed debugging

          Show
          joel.bernstein Joel Bernstein added a comment - More tests and removed debugging
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Added a test with cycle detection

          Show
          joel.bernstein Joel Bernstein added a comment - Added a test with cycle detection
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Initial unit tests are coming along well. I plan to move on to manual testing with the Enron email dataset and if that looks good I think this is pretty close to being committed to trunk.

          Show
          joel.bernstein Joel Bernstein added a comment - Initial unit tests are coming along well. I plan to move on to manual testing with the Enron email dataset and if that looks good I think this is pretty close to being committed to trunk.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          New patch with an efficient TaversalIterator implementation.

          Show
          joel.bernstein Joel Bernstein added a comment - New patch with an efficient TaversalIterator implementation.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Manual testing is looking pretty good on the Enron email dataset. Traversals are reasonably fast, for example a friend of friends of friends traversal returning 6500 leaf nodes responds in 270 millisseconds. This not a huge dataset and things will slow down on large highly interconnected graphs, but that's to be expected. Also there are some big performance and scalability improvements that can be made in later releases. But I think this is getting pretty close to committable.

          Show
          joel.bernstein Joel Bernstein added a comment - Manual testing is looking pretty good on the Enron email dataset. Traversals are reasonably fast, for example a friend of friends of friends traversal returning 6500 leaf nodes responds in 270 millisseconds. This not a huge dataset and things will slow down on large highly interconnected graphs, but that's to be expected. Also there are some big performance and scalability improvements that can be made in later releases. But I think this is getting pretty close to committable.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8659ea33d909ca76c793a778c694feea0c74af3b in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8659ea3 ]

          SOLR-8925: Add gatherNodes Streaming Expression to support breadth first traversals

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8659ea33d909ca76c793a778c694feea0c74af3b in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8659ea3 ] SOLR-8925 : Add gatherNodes Streaming Expression to support breadth first traversals
          Hide
          dpgove Dennis Gove added a comment -

          Joel Bernstein, do you intend to apply this to branch_6x? I'd like to apply SOLR-9009 but it includes changes in GatherNodesStream.java. I can of course pull my changes to that file out before applying but if you do intend to put this on branch_6x then I'll just wait.

          Show
          dpgove Dennis Gove added a comment - Joel Bernstein , do you intend to apply this to branch_6x? I'd like to apply SOLR-9009 but it includes changes in GatherNodesStream.java. I can of course pull my changes to that file out before applying but if you do intend to put this on branch_6x then I'll just wait.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Just replied on the RandomStream ticket, we track the discussion there.

          Show
          joel.bernstein Joel Bernstein added a comment - Just replied on the RandomStream ticket, we track the discussion there.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9ce830d8f2a547b763999ea3790bab6a4d8727a3 in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9ce830d ]

          SOLR-8925: Integrate the GraphTermsQuery into the gatherNodes expression

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9ce830d8f2a547b763999ea3790bab6a4d8727a3 in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9ce830d ] SOLR-8925 : Integrate the GraphTermsQuery into the gatherNodes expression
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d77574abbad62cdf80a8f8978ec439f8a7e6da72 in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d77574a ]

          SOLR-8925: Add gatherNodes Streaming Expression to support breadth first traversals

          Conflicts:
          solr/core/src/java/org/apache/solr/handler/StreamHandler.java

          Show
          jira-bot ASF subversion and git services added a comment - Commit d77574abbad62cdf80a8f8978ec439f8a7e6da72 in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d77574a ] SOLR-8925 : Add gatherNodes Streaming Expression to support breadth first traversals Conflicts: solr/core/src/java/org/apache/solr/handler/StreamHandler.java
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d1f32c04325f97d7fae287de154e628a81c7c98e in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1f32c0 ]

          SOLR-8925: Integrate the GraphTermsQuery into the gatherNodes expression

          Show
          jira-bot ASF subversion and git services added a comment - Commit d1f32c04325f97d7fae287de154e628a81c7c98e in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1f32c0 ] SOLR-8925 : Integrate the GraphTermsQuery into the gatherNodes expression
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 62a28dd0c7dc8f41e43d5c37e28c968556b8e9d2 in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=62a28dd ]

          SOLR-8986, SOLR-8925, SOLR-9027: Update CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit 62a28dd0c7dc8f41e43d5c37e28c968556b8e9d2 in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=62a28dd ] SOLR-8986 , SOLR-8925 , SOLR-9027 : Update CHANGES.txt
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit df72df1c58a5884774d003eec2f71c27a4737896 in lucene-solr's branch refs/heads/branch_6x from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=df72df1 ]

          SOLR-8986, SOLR-8925, SOLR-9027: Update CHANGES.txt

          Conflicts:
          solr/CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit df72df1c58a5884774d003eec2f71c27a4737896 in lucene-solr's branch refs/heads/branch_6x from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=df72df1 ] SOLR-8986 , SOLR-8925 , SOLR-9027 : Update CHANGES.txt Conflicts: solr/CHANGES.txt

            People

            • Assignee:
              joel.bernstein Joel Bernstein
              Reporter:
              joel.bernstein Joel Bernstein
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development