Lucene - Core
  1. Lucene - Core
  2. LUCENE-3759

Support joining in a distributed environment.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/join
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Add two more methods in JoinUtil to support joining in a distributed manner.

      • Method to retrieve all from values.
      • Method to create a TermsQuery based on a set of from terms.

      With these two methods distributed joining can be supported following these steps:

      1. Retrieve from values from each shard
      2. Merge the retrieved from values.
      3. Create a TermsQuery based on the merged from terms and send this query to all shards.

        Activity

        Hide
        Jason Rutherglen added a comment -

        +1 Nice, distributed join will be super useful.

        Show
        Jason Rutherglen added a comment - +1 Nice, distributed join will be super useful.
        Hide
        Alex Liu added a comment -

        is there any performance concern?

        Show
        Alex Liu added a comment - is there any performance concern?
        Hide
        Colin Bartolome added a comment -

        This definitely affects Solr 4.1 and would be very helpful. I might not be able to run with shards without being able to use join queries.

        Show
        Colin Bartolome added a comment - This definitely affects Solr 4.1 and would be very helpful. I might not be able to run with shards without being able to use join queries.
        Hide
        Colin Bartolome added a comment -

        Would implementing this as a TermsQuery open us up to TooManyClauses exceptions?

        Show
        Colin Bartolome added a comment - Would implementing this as a TermsQuery open us up to TooManyClauses exceptions?
        Hide
        Jerry Russell added a comment -

        Is there any progress on this? This seems like a very important feature that is missing from SOLR at this point.

        Show
        Jerry Russell added a comment - Is there any progress on this? This seems like a very important feature that is missing from SOLR at this point.
        Hide
        Erick Erickson added a comment -

        no patches == no progress.

        Show
        Erick Erickson added a comment - no patches == no progress.
        Hide
        Joe Szymanski added a comment -

        Is anyone currently working on this? I want this feature bad enough that I plan on implementing this, but don't want to duplicate work.

        Show
        Joe Szymanski added a comment - Is anyone currently working on this? I want this feature bad enough that I plan on implementing this, but don't want to duplicate work.
        Hide
        Scott Blum added a comment -

        Joe Szymanski did you start on this?

        Everyone else, I would love to work on this, but I'll need some high-level guidance. It's an area I haven't worked in before.

        Show
        Scott Blum added a comment - Joe Szymanski did you start on this? Everyone else, I would love to work on this, but I'll need some high-level guidance. It's an area I haven't worked in before.
        Hide
        Scott Blum added a comment -

        Stupid question? But is this obseleted by https://issues.apache.org/jira/browse/SOLR-7584 or is this dealing with something different?

        Show
        Scott Blum added a comment - Stupid question? But is this obseleted by https://issues.apache.org/jira/browse/SOLR-7584 or is this dealing with something different?
        Hide
        Scott Blum added a comment -

        Ping? Anyone still care or know anything about this issue?

        Show
        Scott Blum added a comment - Ping? Anyone still care or know anything about this issue?
        Hide
        Jerry Russell added a comment -

        I am still waiting for it - but only if it can perform reasonably well..

        Show
        Jerry Russell added a comment - I am still waiting for it - but only if it can perform reasonably well..
        Hide
        Scott Blum added a comment -

        Request for feedback / comments: https://github.com/fullstorydev/lucene-solr/commits/scottb/fulljoin

        Basically, it's a drop-in replacement for JoinQParserPlugin, except instead of curating the "from" terms from the local index, it does a collection-wide facet query to generate the term list.

        Show
        Scott Blum added a comment - Request for feedback / comments: https://github.com/fullstorydev/lucene-solr/commits/scottb/fulljoin Basically, it's a drop-in replacement for JoinQParserPlugin, except instead of curating the "from" terms from the local index, it does a collection-wide facet query to generate the term list.
        Hide
        Ashish Datta added a comment -

        Hi guys,
        Is this issue being addressed in a future release etc. ?
        In order that Solr/Lucene be able to horizontally shard and yet give a unified view to queries that need to access joined data, I think this will be a BIG hit !
        I saw a similar thing in the Mongo system where a 'queryrouter' did the same job of sending parallel query requests to multiple servers with individual shards and returned a consistent result. Though the two tools are entirely different, if the data/facets distribution and shard keying is known, this does not seem unsurmountable in Lucene.
        Would be really interested and eager to provide a use case in an actual production scenario where the lack of this feature is causing some grief ! and increasing the query coding to compensate for it.

        Show
        Ashish Datta added a comment - Hi guys, Is this issue being addressed in a future release etc. ? In order that Solr/Lucene be able to horizontally shard and yet give a unified view to queries that need to access joined data, I think this will be a BIG hit ! I saw a similar thing in the Mongo system where a 'queryrouter' did the same job of sending parallel query requests to multiple servers with individual shards and returned a consistent result. Though the two tools are entirely different, if the data/facets distribution and shard keying is known, this does not seem unsurmountable in Lucene. Would be really interested and eager to provide a use case in an actual production scenario where the lack of this feature is causing some grief ! and increasing the query coding to compensate for it.
        Hide
        Erick Erickson added a comment -

        Scott is asking a pertinent question I think. I really do wonder how much of the use-case here will be satisfied by both the Streaming Aggregations (5.x) and ParallelSQL (6.0).

        I'd really like to have the use-case laid out and show that at least most of the use-cases are served by distributed joins and not the ParallelSQL capabilities before putting too much effort here.

        Show
        Erick Erickson added a comment - Scott is asking a pertinent question I think. I really do wonder how much of the use-case here will be satisfied by both the Streaming Aggregations (5.x) and ParallelSQL (6.0). I'd really like to have the use-case laid out and show that at least most of the use-cases are served by distributed joins and not the ParallelSQL capabilities before putting too much effort here.
        Hide
        Ashish Datta added a comment -

        Hello Erick,
        I would be glad to present a case for this if it helps. Let me know if it helps. If it does not sound like a useful use-case, perhaps I could use some other tool.
        Here's a quick overview of the use-case:
        The requirement I have is in analytics. Search results need to be exact and we're basically 'counting' things precisely, not approximating. The no. of facets is not large but their combinations are large in number(hence the strong case for Solr).
        The number of distinct data containers(collections) is small but their sizes are large and denormalizing or keeping data in single servers are not feasible options.
        Therefore joins are becoming inevitable as data grows and starts to need many servers to store it due to size constraints and computing efficiency.
        Right now, the only option I have is to use a glue language to collect the 'from' terms from the many 'shards' across servers, send queries with these terms to the 'to' collection shards on several servers again, apply rules to aggregate them centrally, manage timeouts and other artificial issues created by this data division and sent the aggregated data for visualisations or other processing.
        As you can see, the charm and pull of Lucene's speed is getting dampened by the unnecessary data complexity and dependence on programming in a glue language , recording the number and types of shards on each server and making queries to the right targets. Redundancy/failover is another pain to handle besides managing increasing servers.

        Everything I have written is already possible and avaliable in Solr except that it's not on a distributed manner ?

        Solr is a beautiful tool that can easily do everything I need if my data were not needed to be distributed across machine as in my case !
        If I denormalize this kind of data, I might end up making it 3-4x it's size, which obviously I dont want to do.
        If Solr managed to take away this pain, it would be the ideal scalable solution for all search applications and analytic applications which have multiple large, data sets with limits to denormalization.
        In my case, I know the data very well and have a good grip on the combinations of facets needed to configure a distributed system if it just allowed joins with true sharding.

        I really think that adding this will bring in lots of distributed computing use-cases into the ambit of Solr. There's no telling the amount of efforts it will save for people like me, and not have everybody devising the own distributed computing management scheme when a common one could solve it for all.

        Let me know if this sounds like a reasonable use-case. Besides my own use-case, I'm sure there would be a lot of people who probably dont use solr due to this missing feature.

        PS : Sorry for getting carried away and the long mail ;-(

        Show
        Ashish Datta added a comment - Hello Erick, I would be glad to present a case for this if it helps. Let me know if it helps. If it does not sound like a useful use-case, perhaps I could use some other tool. Here's a quick overview of the use-case: The requirement I have is in analytics. Search results need to be exact and we're basically 'counting' things precisely, not approximating. The no. of facets is not large but their combinations are large in number(hence the strong case for Solr). The number of distinct data containers(collections) is small but their sizes are large and denormalizing or keeping data in single servers are not feasible options. Therefore joins are becoming inevitable as data grows and starts to need many servers to store it due to size constraints and computing efficiency. Right now, the only option I have is to use a glue language to collect the 'from' terms from the many 'shards' across servers, send queries with these terms to the 'to' collection shards on several servers again, apply rules to aggregate them centrally, manage timeouts and other artificial issues created by this data division and sent the aggregated data for visualisations or other processing. As you can see, the charm and pull of Lucene's speed is getting dampened by the unnecessary data complexity and dependence on programming in a glue language , recording the number and types of shards on each server and making queries to the right targets. Redundancy/failover is another pain to handle besides managing increasing servers. Everything I have written is already possible and avaliable in Solr except that it's not on a distributed manner ? Solr is a beautiful tool that can easily do everything I need if my data were not needed to be distributed across machine as in my case ! If I denormalize this kind of data, I might end up making it 3-4x it's size, which obviously I dont want to do. If Solr managed to take away this pain, it would be the ideal scalable solution for all search applications and analytic applications which have multiple large, data sets with limits to denormalization. In my case, I know the data very well and have a good grip on the combinations of facets needed to configure a distributed system if it just allowed joins with true sharding. I really think that adding this will bring in lots of distributed computing use-cases into the ambit of Solr. There's no telling the amount of efforts it will save for people like me, and not have everybody devising the own distributed computing management scheme when a common one could solve it for all. Let me know if this sounds like a reasonable use-case. Besides my own use-case, I'm sure there would be a lot of people who probably dont use solr due to this missing feature. PS : Sorry for getting carried away and the long mail ;-(

          People

          • Assignee:
            Unassigned
            Reporter:
            Martijn van Groningen
          • Votes:
            20 Vote for this issue
            Watchers:
            33 Start watching this issue

            Dates

            • Created:
              Updated:

              Development