Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7543

Create GraphQuery that allows graph traversal as a query operator.

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: search
    • Labels:
      None

      Description

      I have a GraphQuery that I implemented a long time back that allows a user to specify a "startQuery" to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query.

      Here's a bit more of a description of the parameters for the query / graph traversal:

      q - the initial start query that identifies the universe of documents to start traversal from.
      fromField - the field name that contains the node id
      toField - the name of the field that contains the edge id(s).
      traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query.
      maxDepth - integer specifying how deep the breadth first search should go.
      returnStartNodes - boolean to determine if the documents that matched the original "q" should be returned as part of the graph.
      onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges.

      We identify a set of documents with "q" as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the "traversalFilter" and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction.

      This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document.

      1. SOLR-7543.patch
        44 kB
        Yonik Seeley
      2. SOLR-7543.patch
        45 kB
        Kevin Watters

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Cool stuff!
        Lets just focus on the interface/semantics of the qparser - the rest is implementation detail and can be further optimized later.

        Since this sort of looks like a multi-step join, perhaps use the same convention of using the primary value (or v) as the start query?

        {!graph fromField=f1 toField=f2 ...}

        my_start_query
        Which is equivalent to

        {!graph fromField=f1 toField=f2 ... v=my_start_query}

        Actually, looking at join again, it uses "from" and "to" - should we use those or stick with "fromField" and "toField"?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Cool stuff! Lets just focus on the interface/semantics of the qparser - the rest is implementation detail and can be further optimized later. Since this sort of looks like a multi-step join, perhaps use the same convention of using the primary value (or v) as the start query? {!graph fromField=f1 toField=f2 ...} my_start_query Which is equivalent to {!graph fromField=f1 toField=f2 ... v=my_start_query} Actually, looking at join again, it uses "from" and "to" - should we use those or stick with "fromField" and "toField"?
        Hide
        kwatters Kevin Watters added a comment -

        Hi Yonik, thanks for chiming in! Yup, you can think of this as a multi-step join. In fact, I use the graph operator with a maxDepth of 1 to implement an inner join.
        I like things to be consistent (it's easier for others to grok that way), we can rename the fromField and the toField to be "from" and "to". When it comes to the GraphQueryParser(Plugin), I'm open to whatever the community likes and whatever is consistent with the other parsers out there.
        (I've always been a bit thrown by the !parser_name syntax, which is why I also have a client side object model so that I programmatically build up an expression, I serialize that expression over to my custom parser that deserializes and converts into the appropriate lucene query objects. ). I suppose I just want to make sure that the "v=my_start_query" can be any arbitrary lucene query.
        I also still need to work up some richer examples and test cases as part of this ticket.

        Show
        kwatters Kevin Watters added a comment - Hi Yonik, thanks for chiming in! Yup, you can think of this as a multi-step join. In fact, I use the graph operator with a maxDepth of 1 to implement an inner join. I like things to be consistent (it's easier for others to grok that way), we can rename the fromField and the toField to be "from" and "to". When it comes to the GraphQueryParser(Plugin), I'm open to whatever the community likes and whatever is consistent with the other parsers out there. (I've always been a bit thrown by the !parser_name syntax, which is why I also have a client side object model so that I programmatically build up an expression, I serialize that expression over to my custom parser that deserializes and converts into the appropriate lucene query objects. ). I suppose I just want to make sure that the "v=my_start_query" can be any arbitrary lucene query. I also still need to work up some richer examples and test cases as part of this ticket.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I've always been a bit thrown by the !parser_name syntax

        Yeah, but I think doing it like all the other qparsers is best for now... and then we can add support for JSON along with the rest of them later (there's a ticket open... just haven't gotten around to it yet).

        I suppose I just want to make sure that the "v=my_start_query" can be any arbitrary lucene query.

        Yep, just use the qparser framework and then the query (or queries) can be of any type parseable by another qparser (including another graph qparser for instance).

                baseParser = subQuery(localParams.get(QueryParsing.V), null);
                Query q = baseParser.getQuery();
        
        Show
        yseeley@gmail.com Yonik Seeley added a comment - I've always been a bit thrown by the !parser_name syntax Yeah, but I think doing it like all the other qparsers is best for now... and then we can add support for JSON along with the rest of them later (there's a ticket open... just haven't gotten around to it yet). I suppose I just want to make sure that the "v=my_start_query" can be any arbitrary lucene query. Yep, just use the qparser framework and then the query (or queries) can be of any type parseable by another qparser (including another graph qparser for instance). baseParser = subQuery(localParams.get(QueryParsing.V), null ); Query q = baseParser.getQuery();
        Hide
        dpgove Dennis Gove added a comment - - edited

        For interface/semantics, I think this might be able to benefit from the Expression stuff recently added for streams (SOLR-7377). With that, you could do something like

        graph(root=search(collection1, q="<some query>", fl="<used fields>"), traverse=search(collection1, q="<some dynamic query>",fl="<used fields>"), on="parent.field=child.field", maxDepth=5, returnRoot=true, returnOnlyLeaf=false)
        

        This would also allow you to do other things like make use of stream merging, uniquing, etc....

        Would even allow for tree traversal across multiple collections.

        Show
        dpgove Dennis Gove added a comment - - edited For interface/semantics, I think this might be able to benefit from the Expression stuff recently added for streams ( SOLR-7377 ). With that, you could do something like graph(root=search(collection1, q= "<some query>" , fl= "<used fields>" ), traverse=search(collection1, q= "<some dynamic query>" ,fl= "<used fields>" ), on= "parent.field=child.field" , maxDepth=5, returnRoot= true , returnOnlyLeaf= false ) This would also allow you to do other things like make use of stream merging, uniquing, etc.... Would even allow for tree traversal across multiple collections.
        Hide
        kwatters Kevin Watters added a comment -

        Interesting Dennis, I wasn't aware of SOLR-7377, I'll have to take a bit more time to understand what that means in the context of the graph query. I'm not sure how cross collection graph traversal will play with my implementation. Issue is, my lucene graph query currently is local to a single shard/core. I have been chatting with Joel Bernstein about the distributed graph traversal use case, and I think there is a play for streaming aggregation there. There is one line that needs to be coordinated/synchronized across the cluster to do the distributed graph traversal, I think that's where the streaming stuff comes in.
        I like the idea of renaming "returnStartNodes" to "returnRoot" ... less words and hopefully more descriptive of what is happening, (same for returnOnlyLeaf .. ) maybe the word "nodes" is redundant, and it obscures that it's really just a "document" at the end of the day.

        Show
        kwatters Kevin Watters added a comment - Interesting Dennis, I wasn't aware of SOLR-7377 , I'll have to take a bit more time to understand what that means in the context of the graph query. I'm not sure how cross collection graph traversal will play with my implementation. Issue is, my lucene graph query currently is local to a single shard/core. I have been chatting with Joel Bernstein about the distributed graph traversal use case, and I think there is a play for streaming aggregation there. There is one line that needs to be coordinated/synchronized across the cluster to do the distributed graph traversal, I think that's where the streaming stuff comes in. I like the idea of renaming "returnStartNodes" to "returnRoot" ... less words and hopefully more descriptive of what is happening, (same for returnOnlyLeaf .. ) maybe the word "nodes" is redundant, and it obscures that it's really just a "document" at the end of the day.
        Hide
        dpgove Dennis Gove added a comment -

        This might be crazy, but would allow a little more flexibility on what to return

        return="Root | Leaf" => would return documents that are either in the root, or a leaf.
        return="Root & Leaf" => would return documents that are in root and are leafs themselves (no children)
        return="Leaf | Children>4" => would return documents that are leaf or have more than 4 children.

        Show
        dpgove Dennis Gove added a comment - This might be crazy, but would allow a little more flexibility on what to return return="Root | Leaf" => would return documents that are either in the root, or a leaf. return="Root & Leaf" => would return documents that are in root and are leafs themselves (no children) return="Leaf | Children>4" => would return documents that are leaf or have more than 4 children.
        Hide
        steff1193 Per Steffensen added a comment - - edited

        Sounds interesting. I can't help thinking that this will help users do only one particular graph'ish search. But there are millions of other graph'ish searches one might want to do. The solution here might be too specific. A while back I wrote a Solr indexing-backend for the Titan graph database. We can do a storage-backend as well. Putting a full blown graph database on top of Solr (by supporting indexing- and potentially storage-backends for e.g. Titan) might be the way to go instead, so that we will not end up with lots and lots of very specific graph-search query-parsers/resolvers. And this way you will get all the other cool stuff from a full blown graph database - e.g. I liked playing with Titans REPL. Just a thought

        Show
        steff1193 Per Steffensen added a comment - - edited Sounds interesting. I can't help thinking that this will help users do only one particular graph'ish search. But there are millions of other graph'ish searches one might want to do. The solution here might be too specific. A while back I wrote a Solr indexing-backend for the Titan graph database. We can do a storage-backend as well. Putting a full blown graph database on top of Solr (by supporting indexing- and potentially storage-backends for e.g. Titan) might be the way to go instead, so that we will not end up with lots and lots of very specific graph-search query-parsers/resolvers. And this way you will get all the other cool stuff from a full blown graph database - e.g. I liked playing with Titans REPL. Just a thought
        Hide
        kwatters Kevin Watters added a comment -

        Per Steffensen My mantra here is relates to something I heard once. "A graph is a filter on top of your data."-someone So, I'm offering this implementation up to solve that use case. Analytics on top of that graph would be acheived via faceting or streaming aggregation. Maybe there's something that Titan could leverage from this implementation? There are some starting plans on doing a distributed version of this query operator.

        Dennis Gove Interesting syntax. The usecase of children > 4 isn't currently supported in my impl. My impl doesn't have any history of the paths through the graph. It only has the bitset that represents the matched documents. I wanted to keep it as lean as possible. We could start keeping around additional data structures during the traversal to count, but that can get very expensive very quickly. My goal/desire here is to keep the memory usage to one bitset.

        Show
        kwatters Kevin Watters added a comment - Per Steffensen My mantra here is relates to something I heard once. "A graph is a filter on top of your data."-someone So, I'm offering this implementation up to solve that use case. Analytics on top of that graph would be acheived via faceting or streaming aggregation. Maybe there's something that Titan could leverage from this implementation? There are some starting plans on doing a distributed version of this query operator. Dennis Gove Interesting syntax. The usecase of children > 4 isn't currently supported in my impl. My impl doesn't have any history of the paths through the graph. It only has the bitset that represents the matched documents. I wanted to keep it as lean as possible. We could start keeping around additional data structures during the traversal to count, but that can get very expensive very quickly. My goal/desire here is to keep the memory usage to one bitset.
        Hide
        dpgove Dennis Gove added a comment - - edited

        I'm with you on wanting to keep the memory usage as low as possible - I thought maybe you had that info hanging around already. In either case, I think this syntax might lower the bar to entry for usage, especially if people are already using streaming aggregation for other things.

        Show
        dpgove Dennis Gove added a comment - - edited I'm with you on wanting to keep the memory usage as low as possible - I thought maybe you had that info hanging around already. In either case, I think this syntax might lower the bar to entry for usage, especially if people are already using streaming aggregation for other things.
        Hide
        steff1193 Per Steffensen added a comment -

        "A graph is a filter on top of your data."

        Yes I completely agree!

        It is just that, as far a I understand the discussion here, it is not just the very simple basics for doing graphs on top of Solr. Seems like you design for a specific "query" that you can do with a graph database: start at some docs --> recursively follow links/edges (up to) maxDepth times --> then return to me all nodes/vertices you came across (or just the leaves). This is one particular graph traversal you can do, but if you just implemented the graph-basics as a Titan search-backend you could perform all kinds of graph "queries" with the Titan search-tool Gremlin (https://github.com/tinkerpop/gremlin/wiki) - that is more like a full graph-solution on top of Solr.

        I just realized from the Titan main-page (http://thinkaurelius.github.io/titan/) that there is now some kind of integration with Solr. There wasnt when I did my implementation, but never mind, I mainly did it to understand how Titan works.

        It is not that I do not like this issue SOLR-7543. I definitely would love to see graphs on top of Solr. I am just trying to take a step back or and look at a more generic approach to all this graph stuff

        Show
        steff1193 Per Steffensen added a comment - "A graph is a filter on top of your data." Yes I completely agree! It is just that, as far a I understand the discussion here, it is not just the very simple basics for doing graphs on top of Solr. Seems like you design for a specific "query" that you can do with a graph database: start at some docs --> recursively follow links/edges (up to) maxDepth times --> then return to me all nodes/vertices you came across (or just the leaves). This is one particular graph traversal you can do, but if you just implemented the graph-basics as a Titan search-backend you could perform all kinds of graph "queries" with the Titan search-tool Gremlin ( https://github.com/tinkerpop/gremlin/wiki ) - that is more like a full graph-solution on top of Solr. I just realized from the Titan main-page ( http://thinkaurelius.github.io/titan/ ) that there is now some kind of integration with Solr. There wasnt when I did my implementation, but never mind, I mainly did it to understand how Titan works. It is not that I do not like this issue SOLR-7543 . I definitely would love to see graphs on top of Solr. I am just trying to take a step back or and look at a more generic approach to all this graph stuff
        Hide
        kwatters Kevin Watters added a comment -

        Hi Per Steffensen I agree we want to do all sorts of great types of graph queries. Problem is, as soon as you take the step to maintain metadata about the graph traversal, the memory requirements for such an operation can be huge.

        The way I see it is there are likely 3 things to do to close the gap:

        • Make the traversalFilter a more complex datastructure (like an array), to allow different filters at different graph traversal levels.
        • accumulate a "weight" field on the traversed edges as part of the relevancy score (currently no ranking is done)
        • maintain the history of edges that traverse into a node.

        All of these could be considered for future functionality, but it would really take some re-thinking of how it all works. For now, having the functionality to apply the graph as a filter to the result set is the goal.

        In many cases, if you nest these graph queries, and the documents are structured properly, you should still be able to achieve the result that you desire, but we'd have to take that on a case by case basis.

        Show
        kwatters Kevin Watters added a comment - Hi Per Steffensen I agree we want to do all sorts of great types of graph queries. Problem is, as soon as you take the step to maintain metadata about the graph traversal, the memory requirements for such an operation can be huge. The way I see it is there are likely 3 things to do to close the gap: Make the traversalFilter a more complex datastructure (like an array), to allow different filters at different graph traversal levels. accumulate a "weight" field on the traversed edges as part of the relevancy score (currently no ranking is done) maintain the history of edges that traverse into a node. All of these could be considered for future functionality, but it would really take some re-thinking of how it all works. For now, having the functionality to apply the graph as a filter to the result set is the goal. In many cases, if you nest these graph queries, and the documents are structured properly, you should still be able to achieve the result that you desire, but we'd have to take that on a case by case basis.
        Hide
        kwatters Kevin Watters added a comment -

        Ok, my initial graph query parser is handling the following syntax

        {!graph from="node_field" to="edge_field" returnRoot="true" returnOnlyLeaf="false" maxDepth=-1 traversalFilter=foo:bar}

        id:doc_8

        The above would start traversal at doc_8 and only walk nodes that have a field foo containing the value bar. This seems to be (more) consistent with the rest of the query parsers.

        Show
        kwatters Kevin Watters added a comment - Ok, my initial graph query parser is handling the following syntax {!graph from="node_field" to="edge_field" returnRoot="true" returnOnlyLeaf="false" maxDepth=-1 traversalFilter=foo:bar} id:doc_8 The above would start traversal at doc_8 and only walk nodes that have a field foo containing the value bar. This seems to be (more) consistent with the rest of the query parsers.
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        Thanks for the nice contribution Kevin Watters!

        A couple of thoughts on the discussion particularly on general VS specific use case. I think this ticket covers a useful specific usecase, particularly for access control. And I suspect people will find other interesting uses for this type of graph query. It works for non-distributed graph traversals which is where we can do all kinds of low level things to improve performance.

        But it's also an opportunity to open the discussion about the generic use case which will be distributed graph queries. Per Steffensen mentions a Titan integration which would be very useful. But also as [~dgove1] mentions Streaming Expressions provides us with an elegant framework for all kinds of parallel computing tasks. I think it's worth exploring how to model parallel graph joins using the Streaming Expression language and the Streaming API.

        Show
        joel.bernstein Joel Bernstein added a comment - - edited Thanks for the nice contribution Kevin Watters ! A couple of thoughts on the discussion particularly on general VS specific use case. I think this ticket covers a useful specific usecase, particularly for access control. And I suspect people will find other interesting uses for this type of graph query. It works for non-distributed graph traversals which is where we can do all kinds of low level things to improve performance. But it's also an opportunity to open the discussion about the generic use case which will be distributed graph queries. Per Steffensen mentions a Titan integration which would be very useful. But also as [~dgove1] mentions Streaming Expressions provides us with an elegant framework for all kinds of parallel computing tasks. I think it's worth exploring how to model parallel graph joins using the Streaming Expression language and the Streaming API.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Syntax looks good Kevin, having an initial simple syntax certainly doesn't preclude adding other syntaxes (or integrations) later. Can you put up a patch?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Syntax looks good Kevin, having an initial simple syntax certainly doesn't preclude adding other syntaxes (or integrations) later. Can you put up a patch?
        Hide
        kwatters Kevin Watters added a comment -

        Yonik Seeley , right now, it builds against 4.x If I submit a patch, should be done for trunk, or is a 4.x branch ok? I'm just finishing up the unit tests, either way, I hope to have a patch submitted by the end of the week.

        Show
        kwatters Kevin Watters added a comment - Yonik Seeley , right now, it builds against 4.x If I submit a patch, should be done for trunk, or is a 4.x branch ok? I'm just finishing up the unit tests, either way, I hope to have a patch submitted by the end of the week.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        You can submit a patch against 4x (or whatever you want). Before it's committed, it will need to be against trunk though.
        That doesn't necessarily mean that you have to do the work to get it on trunk... if the port is too hard or you don't have time, you can put up what you have and others may be able to help.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - You can submit a patch against 4x (or whatever you want). Before it's committed, it will need to be against trunk though. That doesn't necessarily mean that you have to do the work to get it on trunk... if the port is too hard or you don't have time, you can put up what you have and others may be able to help.
        Hide
        kwatters Kevin Watters added a comment -

        Patch with GraphQuery / parsers / unit tests.

        Show
        kwatters Kevin Watters added a comment - Patch with GraphQuery / parsers / unit tests.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Thanks Kevin,
        I just uploaded a patch with some minor improvements, including:

        Slightly simplified some of the parameter parsing... example:
        From:
        boolean onlyLeafNodes = Boolean.valueOf(localParams.get("returnOnlyLeaf", "false"));
        To:
        boolean onlyLeafNodes = localParams.getBool("returnOnlyLeaf", false);

        Simplified some of the query handling, for instance:

            SolrParams params = getParams();
            SolrParams solrParams = SolrParams.wrapDefaults(localParams, params);
            QParser baseParser = subQuery(solrParams.get(QueryParsing.V), null);
            // grab the graph query options / defaults
            Query rootNodeQuery = baseParser.getQuery();  
        

        Was replaced with

            Query rootNodeQuery = subQuery(localParams.get(QueryParsing.V), null).getQuery();
        

        I rewrote buildFrontierQuery to use TermsQuery instead of BooleanQuery (more efficient for this use case, and no 1024 limit).

        I also marked FrontierQuery as internal and made it package protected... it's an implementation detail and it feels like
        we could get rid of it in the future.

        Unless anyone has ideas of how to improve the current interface, I think this is ready to commit! (at least to trunk) We can continue to make more optimizations to the implementation at any point.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Thanks Kevin, I just uploaded a patch with some minor improvements, including: Slightly simplified some of the parameter parsing... example: From: boolean onlyLeafNodes = Boolean.valueOf(localParams.get("returnOnlyLeaf", "false")); To: boolean onlyLeafNodes = localParams.getBool("returnOnlyLeaf", false); Simplified some of the query handling, for instance: SolrParams params = getParams(); SolrParams solrParams = SolrParams.wrapDefaults(localParams, params); QParser baseParser = subQuery(solrParams.get(QueryParsing.V), null ); // grab the graph query options / defaults Query rootNodeQuery = baseParser.getQuery(); Was replaced with Query rootNodeQuery = subQuery(localParams.get(QueryParsing.V), null ).getQuery(); I rewrote buildFrontierQuery to use TermsQuery instead of BooleanQuery (more efficient for this use case, and no 1024 limit). I also marked FrontierQuery as internal and made it package protected... it's an implementation detail and it feels like we could get rid of it in the future. Unless anyone has ideas of how to improve the current interface, I think this is ready to commit! (at least to trunk) We can continue to make more optimizations to the implementation at any point.
        Hide
        kwatters Kevin Watters added a comment -

        Nice improvements! The new TermsQuery, that definitely is a nice fit for this type of query. (though that code path is only active if useAutn=false so it doesn't do the automaton compilation. )
        Looks good to me, lets roll with it!

        Show
        kwatters Kevin Watters added a comment - Nice improvements! The new TermsQuery, that definitely is a nice fit for this type of query. (though that code path is only active if useAutn=false so it doesn't do the automaton compilation. ) Looks good to me, lets roll with it!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1707818 from Yonik Seeley in branch 'dev/trunk'
        [ https://svn.apache.org/r1707818 ]

        SOLR-7543: basic graph traversal query

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1707818 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1707818 ] SOLR-7543 : basic graph traversal query
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Committed. Thanks Kevin!

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Committed. Thanks Kevin!

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            kwatters Kevin Watters
          • Votes:
            3 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development