Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.2.1
    • Fix Version/s: Trunk
    • Component/s: search
    • Labels:
      None

      Description

      This contrib provides a place where different join implementations can be contributed to Solr. This contrib currently includes 3 join implementations. The initial patch was generated from the Solr 4.3 tag. Because of changes in the FieldCache API this patch will only build with Solr 4.2 or above.

      HashSetJoinQParserPlugin aka hjoin

      The hjoin provides a join implementation that filters results in one core based on the results of a search in another core. This is similar in functionality to the JoinQParserPlugin but the implementation differs in a couple of important ways.

      The first way is that the hjoin is designed to work with int and long join keys only. So, in order to use hjoin, int or long join keys must be included in both the to and from core.

      The second difference is that the hjoin builds memory structures that are used to quickly connect the join keys. So, the hjoin will need more memory then the JoinQParserPlugin to perform the join.

      The main advantage of the hjoin is that it can scale to join millions of keys between cores and provide sub-second response time. The hjoin should work well with up to two million results from the fromIndex and tens of millions of results from the main query.

      The hjoin supports the following features:

      1) Both lucene query and PostFilter implementations. A "cost" > 99 will turn on the PostFilter. The PostFilter will typically outperform the Lucene query when the main query results have been narrowed down.

      2) With the lucene query implementation there is an option to build the filter with threads. This can greatly improve the performance of the query if the main query index is very large. The "threads" parameter turns on threading. For example threads=6 will use 6 threads to build the filter. This will setup a fixed threadpool with six threads to handle all hjoin requests. Once the threadpool is created the hjoin will always use it to build the filter. Threading does not come into play with the PostFilter.

      3) The size local parameter can be used to set the initial size of the hashset used to perform the join. If this is set above the number of results from the fromIndex then the you can avoid hashset resizing which improves performance.

      4) Nested filter queries. The local parameter "fq" can be used to nest a filter query within the join. The nested fq will filter the results of the join query. This can point to another join to support nested joins.

      5) Full caching support for the lucene query implementation. The filterCache and queryResultCache should work properly even with deep nesting of joins. Only the queryResultCache comes into play with the PostFilter implementation because PostFilters are not cacheable in the filterCache.

      The syntax of the hjoin is similar to the JoinQParserPlugin except that the plugin is referenced by the string "hjoin" rather then "join".

      fq={!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 fq=$qq}user:customer1&qq=group:5

      The example filter query above will search the fromIndex (collection2) for "user:customer1" applying the local fq parameter to filter the results. The lucene filter query will be built using 6 threads. This query will generate a list of values from the "from" field that will be used to filter the main query. Only records from the main query, where the "to" field is present in the "from" list will be included in the results.

      The solrconfig.xml in the main query core must contain the reference to the hjoin.

      <queryParser name="hjoin" class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>

      And the join contrib lib jars must be registed in the solrconfig.xml.

      <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />

      After issuing the "ant dist" command from inside the solr directory the joins contrib jar will appear in the solr/dist directory. Place the the solr-joins-4.*-.jar in the WEB-INF/lib directory of the solr webapplication. This will ensure that the top level Solr classloader loads these classes rather then the core's classloaded.

      BitSetJoinQParserPlugin aka bjoin

      The bjoin behaves exactly like the hjoin but uses a BitSet instead of a HashSet to perform the underlying join. Because of this the bjoin is much faster and can provide sub-second response times on result sets of tens of millions of records from the fromIndex and hundreds of millions of records from the main query.

      But there are limitations to how the bjoin can be used. The bjoin treats the join keys as addresses in a BitSet and uses the Lucene OpenBitSet implementation which performs very well but is not sparse. So the BitSet memory is dictated by the size of the join keys. For example a bitset with a max join key of 200,000,000 will need 25 MB of memory. For this reason the BitSet join does not support long join keys. In order to keep memory usage down the join keys should also be packed at the low end, for example from 1 to 50,000,000.

      Below is a sampe bjoin:

      fq={!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 fq=$qq}user:customer1&qq=group:5

      To register the bjoin the solrconfig.xml in the main query core must contain the reference to the bjoin.

      <queryParser name="bjoin" class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>

      ValueSourceJoinParserPlugin aka vjoin

      The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". This implements a ValueSource function query that can return a value from a second core based on join keys and limiting query. The limiting query can be used to select a specific subset of data from the join core. This allows customer specific relevance data to be stored in a separate core and then joined in the main query.

      The vjoin is called using the "vjoin" function query. For example:

      bf=vjoin(fromCore, fromKey, fromVal, toKey, query)

      This example shows "vjoin" being called by the edismax boost function parameter. This example will return the "fromVal" from the "fromCore". The "fromKey" and "toKey" are used to link the records from the main query to the records in the "fromCore". The "query" is used to select a specific set of records to join with in fromCore.

      Currently the fromKey and toKey must be longs but this will change in future versions. Like the pjoin, the "join" SolrCache is used to hold the join memory structures.

      To configure the vjoin you must register the ValueSource plugin in the solrconfig.xml as follows:

      <valueSourceParser name="vjoin" class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />

      1. SOLR-4797-hjoin-multivaluekeys-trunk.patch
        47 kB
        Kranti Parisa
      2. SOLR-4797-hjoin-multivaluekeys-nestedJoins.patch
        47 kB
        Kranti Parisa
      3. SOLR-4787-with-testcase-fix.patch
        117 kB
        Arul Kalaipandian
      4. SOLR-4787-pjoin-long-keys.patch
        24 kB
        Kranti Parisa
      5. SOLR-4787-deadlock-fix.patch
        8 kB
        Steven Bower
      6. SOLR-4787.patch
        23 kB
        Joel Bernstein
      7. SOLR-4787.patch
        22 kB
        Joel Bernstein
      8. SOLR-4787.patch
        23 kB
        Joel Bernstein
      9. SOLR-4787.patch
        24 kB
        Joel Bernstein
      10. SOLR-4787.patch
        25 kB
        Joel Bernstein
      11. SOLR-4787.patch
        35 kB
        Joel Bernstein
      12. SOLR-4787.patch
        35 kB
        Joel Bernstein
      13. SOLR-4787.patch
        25 kB
        Joel Bernstein
      14. SOLR-4787.patch
        26 kB
        Joel Bernstein
      15. SOLR-4787.patch
        23 kB
        Joel Bernstein
      16. SOLR-4787.patch
        49 kB
        Joel Bernstein
      17. SOLR-4787.patch
        65 kB
        Joel Bernstein
      18. SOLR-4787.patch
        75 kB
        Joel Bernstein
      19. SOLR-4787.patch
        78 kB
        Joel Bernstein

        Activity

        Hide
        Joel Bernstein added a comment -

        Initial pjoin and vjoin contrib.

        TODO: Tests need to be created and the vjoin has some insanity issues with the FieldCache that will eventually be solved by using on-disk DocValues.

        Show
        Joel Bernstein added a comment - Initial pjoin and vjoin contrib. TODO: Tests need to be created and the vjoin has some insanity issues with the FieldCache that will eventually be solved by using on-disk DocValues.
        Hide
        Jack Krupansky added a comment -

        Is there any particular reason that only integer keys can be used for the join key, as opposed to, say, string keys?

        Can the implementation be readily adapted to string join keys?

        Show
        Jack Krupansky added a comment - Is there any particular reason that only integer keys can be used for the join key, as opposed to, say, string keys? Can the implementation be readily adapted to string join keys?
        Hide
        Joel Bernstein added a comment -

        The integer keys are faster to join and take up less memory in the in-memory join structures. So, string keys won't scale nearly as well. It may be possible to make them work, but it might scale about the same as the JoinQParserPlugin. Possibly other high performance string joins can be contribed as well.

        Show
        Joel Bernstein added a comment - The integer keys are faster to join and take up less memory in the in-memory join structures. So, string keys won't scale nearly as well. It may be possible to make them work, but it might scale about the same as the JoinQParserPlugin. Possibly other high performance string joins can be contribed as well.
        Hide
        David Smiley added a comment -

        Nice Joel! I've done a custom join query recently but it's a bit different than either of yours. I read your pjoin code in particular and it looks very good, mostly. Your BSearch class is the only thing that made me frown. Instead of putting each name-value pair into their own key class (which isn't GC friendly), I suggest you take a look at Lucene's SorterTemplate which will allow you to collect your key & value integers directly into an array each, and then sort in-place when done. I like your idea on caching the join; I should do that with mine.

        Show
        David Smiley added a comment - Nice Joel! I've done a custom join query recently but it's a bit different than either of yours. I read your pjoin code in particular and it looks very good, mostly. Your BSearch class is the only thing that made me frown. Instead of putting each name-value pair into their own key class (which isn't GC friendly), I suggest you take a look at Lucene's SorterTemplate which will allow you to collect your key & value integers directly into an array each, and then sort in-place when done. I like your idea on caching the join; I should do that with mine.
        Hide
        Joel Bernstein added a comment -

        Thanks David! Yeah, agreed the BSearch class is not ideal. I'll have a look at the SorterTemplate and get the integers sorted in place.

        Show
        Joel Bernstein added a comment - Thanks David! Yeah, agreed the BSearch class is not ideal. I'll have a look at the SorterTemplate and get the integers sorted in place.
        Hide
        Adrien Grand added a comment -

        Hi Joel. SorterTemplate has just been refactored into org.apache.lucene.util.Sorter (LUCENE-4946). You can have a look at Passage.sort() (https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/postingshighlight/Passage.java) to see how to use it to sort parallel arrays.

        Show
        Adrien Grand added a comment - Hi Joel. SorterTemplate has just been refactored into org.apache.lucene.util.Sorter ( LUCENE-4946 ). You can have a look at Passage.sort() ( https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/postingshighlight/Passage.java ) to see how to use it to sort parallel arrays.
        Hide
        Joel Bernstein added a comment -

        Hi Adrien, thanks for the information. I'll take a look at the Sorter today.

        Show
        Joel Bernstein added a comment - Hi Adrien, thanks for the information. I'll take a look at the Sorter today.
        Hide
        Joel Bernstein added a comment - - edited

        Changed the BSearch class to use the SorterTemplate rather then Collections.sort. Much more efficient inplace sorting. SorterTemplate builds with Solr 4.2.1. Will need to get this working with trunk as well using the new Sorter class.

        Thanks David and Adrien for tips on this.

        Found major bug in my original logic for how segment level readers were being used between the join cores and fixed that as well.

        Show
        Joel Bernstein added a comment - - edited Changed the BSearch class to use the SorterTemplate rather then Collections.sort. Much more efficient inplace sorting. SorterTemplate builds with Solr 4.2.1. Will need to get this working with trunk as well using the new Sorter class. Thanks David and Adrien for tips on this. Found major bug in my original logic for how segment level readers were being used between the join cores and fixed that as well.
        Hide
        Joel Bernstein added a comment -

        vjoin now uses DocValues if available.

        Show
        Joel Bernstein added a comment - vjoin now uses DocValues if available.
        Hide
        Kevin Watters added a comment -

        Hey Joel,
        It was good to meet you at the conference last week. We talked a little bit about my GraphQuery operator. The use case of a 1 level graph traversal can accomplish a post filter join request. The caviot is that you won't know which record was joined to, only that it did satisfy the join requirement. I could contribute it here, or perhaps we could create a Graph Contrib ticket?
        Thanks,
        -Kevin

        Show
        Kevin Watters added a comment - Hey Joel, It was good to meet you at the conference last week. We talked a little bit about my GraphQuery operator. The use case of a 1 level graph traversal can accomplish a post filter join request. The caviot is that you won't know which record was joined to, only that it did satisfy the join requirement. I could contribute it here, or perhaps we could create a Graph Contrib ticket? Thanks, -Kevin
        Hide
        Joel Bernstein added a comment -

        Hi Kevin,

        Great to meet you as well. Very interested in your GraphQuery. Probably best to create your own ticket and then we can link the tickets. You could do a Graph contrib and we can decide later on if we want a single join contrib.

        Thanks,
        Joel

        Show
        Joel Bernstein added a comment - Hi Kevin, Great to meet you as well. Very interested in your GraphQuery. Probably best to create your own ticket and then we can link the tickets. You could do a Graph contrib and we can decide later on if we want a single join contrib. Thanks, Joel
        Hide
        Joel Bernstein added a comment -

        Fixed bug with vjoin where it was including deleted docs in the join. Now vjoin only uses live docs.

        Show
        Joel Bernstein added a comment - Fixed bug with vjoin where it was including deleted docs in the join. Now vjoin only uses live docs.
        Hide
        Joel Bernstein added a comment -

        Added vjoin2 and broke out several inner classes into there own class.

        Show
        Joel Bernstein added a comment - Added vjoin2 and broke out several inner classes into there own class.
        Hide
        Joel Bernstein added a comment -

        Removed confusing comments from vjoin2 that were copied over from vjoin.

        Show
        Joel Bernstein added a comment - Removed confusing comments from vjoin2 that were copied over from vjoin.
        Hide
        Kranti Parisa added a comment -

        Hi Joel, idea looks really great. Is it too costly to implement this with "long" instead of "integers"? with longs we can support bigger numbers which could be part of the "key" fields.

        please share your ideas.

        Show
        Kranti Parisa added a comment - Hi Joel, idea looks really great. Is it too costly to implement this with "long" instead of "integers"? with longs we can support bigger numbers which could be part of the "key" fields. please share your ideas.
        Hide
        Joel Bernstein added a comment -

        Longs will double the memory overhead but performance will be the same.

        Which join are you interested in?

        Show
        Joel Bernstein added a comment - Longs will double the memory overhead but performance will be the same. Which join are you interested in?
        Hide
        Kranti Parisa added a comment -

        Yes, I think RAM is not that costly these days. As long as performance won't be impacted too much, longs would give greater flexibility when the "keys" needs to hold big numbers.

        I am looking at JoinValueSourceParserPlugin2, this can fetch scores/values from the "fromCore" right?

        Show
        Kranti Parisa added a comment - Yes, I think RAM is not that costly these days. As long as performance won't be impacted too much, longs would give greater flexibility when the "keys" needs to hold big numbers. I am looking at JoinValueSourceParserPlugin2, this can fetch scores/values from the "fromCore" right?
        Hide
        Joel Bernstein added a comment -

        Performance with longs should be as good as with ints.

        Yes, JoinValueSourceParserPlugin2 fetches values from the fromCore. This is actually the join I'm most interested in as well.

        I'll be revisiting this ticket soon to add tests, long support and probably write a blog about the JoinValueSourceParserPlugin2. In the meantime let me know if you need more information to get it running.

        Show
        Joel Bernstein added a comment - Performance with longs should be as good as with ints. Yes, JoinValueSourceParserPlugin2 fetches values from the fromCore. This is actually the join I'm most interested in as well. I'll be revisiting this ticket soon to add tests, long support and probably write a blog about the JoinValueSourceParserPlugin2. In the meantime let me know if you need more information to get it running.
        Hide
        Kranti Parisa added a comment -

        Great, thanks. I am sure it will help many use cases. I am trying to fetch the "id" field from the "secondCore" using the join between "parentId_secondCore" and "id_firstCore"

        I want to give it a try for the long support. do we want to have an additional param which tell us what data structures to use? and by default it uses int? what's the best way?

        Show
        Kranti Parisa added a comment - Great, thanks. I am sure it will help many use cases. I am trying to fetch the "id" field from the "secondCore" using the join between "parentId_secondCore" and "id_firstCore" I want to give it a try for the long support. do we want to have an additional param which tell us what data structures to use? and by default it uses int? what's the best way?
        Hide
        Joel Bernstein added a comment -

        We can get this information from the schema by looking at the data types from the to and from field.

        Just reviewed the code and you'll have one big hurdle with getting longs to work.

        The BSearch class uses the Lucene SorterTemplate to sort a parallel array. The SorteTemplate does not have long support. So if you want to use the same approach you'll have find another way to sort the parallel array.

        Do you have to have long support, or is it just a nice to have?

        Show
        Joel Bernstein added a comment - We can get this information from the schema by looking at the data types from the to and from field. Just reviewed the code and you'll have one big hurdle with getting longs to work. The BSearch class uses the Lucene SorterTemplate to sort a parallel array. The SorteTemplate does not have long support. So if you want to use the same approach you'll have find another way to sort the parallel array. Do you have to have long support, or is it just a nice to have?
        Hide
        Kranti Parisa added a comment -

        Getting info from the schema is a good approach.

        About long support, my "key" fields in the data set are longs.

        Show
        Kranti Parisa added a comment - Getting info from the schema is a good approach. About long support, my "key" fields in the data set are longs.
        Hide
        Joel Bernstein added a comment -

        If they are actual longs (greater then 2147483647) and not just defined that way in the schema, then you'll need long support.

        Otherwise we can cast them to ints in memory and use them.

        Show
        Joel Bernstein added a comment - If they are actual longs (greater then 2147483647) and not just defined that way in the schema, then you'll need long support. Otherwise we can cast them to ints in memory and use them.
        Hide
        Kranti Parisa added a comment -

        yes, the numbers are greater than 2147483647!

        Show
        Kranti Parisa added a comment - yes, the numbers are greater than 2147483647!
        Hide
        Joel Bernstein added a comment -

        I'd like to switch this to a hash join rather then using the binary search anyway. For longs it would be great to use a HashMap that works with primitive keys, like Trove. Trove is LGPL I believe so I don't think we can use it though.

        I'll look around and see if I can find another library that does what Trove does.

        Let me know if you know of another one or you've got an implementation lying around.

        Show
        Joel Bernstein added a comment - I'd like to switch this to a hash join rather then using the binary search anyway. For longs it would be great to use a HashMap that works with primitive keys, like Trove. Trove is LGPL I believe so I don't think we can use it though. I'll look around and see if I can find another library that does what Trove does. Let me know if you know of another one or you've got an implementation lying around.
        Hide
        Joel Bernstein added a comment -

        Colt looks promising and it's under the Cern license which is very permissive. I'll test it out.

        Show
        Joel Bernstein added a comment - Colt looks promising and it's under the Cern license which is very permissive. I'll test it out.
        Hide
        Kranti Parisa added a comment -

        Even I have been using Trove lib. Along with Colt, the following looks interesting too
        http://javolution.org/core-java/target/apidocs/javolution/util/FastMap.html
        https://code.google.com/p/guava-libraries/

        Show
        Kranti Parisa added a comment - Even I have been using Trove lib. Along with Colt, the following looks interesting too http://javolution.org/core-java/target/apidocs/javolution/util/FastMap.html https://code.google.com/p/guava-libraries/
        Hide
        David Smiley added a comment -

        I suggest either FastUtil, or the similar HPPC (by Dawid Weiss here at the ASF).

        For a single class it may make sense to copy it in source from. That kinda makes me cringe but for just one source file and for something that is externally tested and unlikely to have an unknown bug, I think it's fine.

        Show
        David Smiley added a comment - I suggest either FastUtil , or the similar HPPC (by Dawid Weiss here at the ASF). For a single class it may make sense to copy it in source from. That kinda makes me cringe but for just one source file and for something that is externally tested and unlikely to have an unknown bug, I think it's fine.
        Hide
        Dawid Weiss added a comment -

        Pull a class or two in source code form from fastutil or from HPPC. These are nearly identical these days, fastutil has support for Java collections interfaces (HPPC has its own API not stemming from JUC). Both of these are equally fast.

        Show
        Dawid Weiss added a comment - Pull a class or two in source code form from fastutil or from HPPC. These are nearly identical these days, fastutil has support for Java collections interfaces (HPPC has its own API not stemming from JUC). Both of these are equally fast.
        Hide
        Dawid Weiss added a comment -

        Oh, one more thing – Colt is no longer maintained and there were a number of bugs in it. These have been fixed when Colt was ported to Apache Mahout; those classes are not part of Mahout Math.

        I'd still recommend using Fastutil or Hppc since these will be faster (by an inch but always).

        Show
        Dawid Weiss added a comment - Oh, one more thing – Colt is no longer maintained and there were a number of bugs in it. These have been fixed when Colt was ported to Apache Mahout; those classes are not part of Mahout Math. I'd still recommend using Fastutil or Hppc since these will be faster (by an inch but always).
        Hide
        Joel Bernstein added a comment -

        Thanks for the recommendations, FastUtil looks great. I'm going to switch the JoinValueSourceParserPlugin2 over to use a hash join on long keys.

        Show
        Joel Bernstein added a comment - Thanks for the recommendations, FastUtil looks great. I'm going to switch the JoinValueSourceParserPlugin2 over to use a hash join on long keys.
        Hide
        Kranti Parisa added a comment -

        great, thanks Joel.

        Show
        Kranti Parisa added a comment - great, thanks Joel.
        Hide
        Joel Bernstein added a comment - - edited

        New patch.

        JoinValueSourceParserPlugin2 has been renamed the ValueSourceJoinParserPlugin. It also now supports long join keys only. This will be changed soon to support both int and long join keys.

        A README.txt has been added which explains the setup.

        Many other changes as well that will be reflected in the ticket description.

        Show
        Joel Bernstein added a comment - - edited New patch. JoinValueSourceParserPlugin2 has been renamed the ValueSourceJoinParserPlugin. It also now supports long join keys only. This will be changed soon to support both int and long join keys. A README.txt has been added which explains the setup. Many other changes as well that will be reflected in the ticket description.
        Hide
        Joel Bernstein added a comment - - edited

        Kranti,

        The vjoin has two performance hotspots:

        1) The creation of the HashMap for the hashjoin. My testing shows that it can load 2-3 million key/pairs in around 200 milliseconds.

        2) The hash key lookup each time the vjoin is called. This will be called for each document that is scored in the result set. This should scale to support result sets into the millions. I tested with 4,000,000 results and had excellent performance.

        Show
        Joel Bernstein added a comment - - edited Kranti, The vjoin has two performance hotspots: 1) The creation of the HashMap for the hashjoin. My testing shows that it can load 2-3 million key/pairs in around 200 milliseconds. 2) The hash key lookup each time the vjoin is called. This will be called for each document that is scored in the result set. This should scale to support result sets into the millions. I tested with 4,000,000 results and had excellent performance.
        Hide
        Kranti Parisa added a comment -

        Joel,

        Thanks for the information. I shall update the plugin and test the same.

        Show
        Kranti Parisa added a comment - Joel, Thanks for the information. I shall update the plugin and test the same.
        Hide
        Joel Bernstein added a comment - - edited

        Switched from fastutil to hppc for the primitive map used for the hashjoin.

        Also, further performance testing has shown that the performance on loading the LongInt hash map was much better then I initially thought. The vjoin will scale comfortably into the millions of keys as well.

        Show
        Joel Bernstein added a comment - - edited Switched from fastutil to hppc for the primitive map used for the hashjoin. Also, further performance testing has shown that the performance on loading the LongInt hash map was much better then I initially thought. The vjoin will scale comfortably into the millions of keys as well.
        Hide
        Kranti Parisa added a comment -

        I am trying to apply this patch on branch_4x code (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x). but getting the following error, any idea?

        wget https://issues.apache.org/jira/secure/attachment/12587067/SOLR-4787.patch -O - | patch -p0 --dry-run

        -2013-06-23 17:35:32- https://issues.apache.org/jira/secure/attachment/12587067/SOLR-4787.patch
        Resolving issues.apache.org... 140.211.11.121
        Connecting to issues.apache.org|140.211.11.121|:443... connected.
        HTTP request sent, awaiting response... 200 OK
        Length: 26641 (26K) [text/x-patch]
        Saving to: `STDOUT'

        100%[===========================================================>] 26,641 143K/s in 0.2s

        2013-06-23 17:35:32 (143 KB/s) - written to stdout [26641/26641]

        patching file solr/example/solr/collection1/conf/solrconfig.xml
        Hunk #1 FAILED at 81.
        Hunk #2 succeeded at 515 (offset -2 lines).
        Hunk #3 succeeded at 564 (offset -2 lines).
        Hunk #4 succeeded at 1535 (offset 6 lines).
        Hunk #5 succeeded at 1548 (offset 6 lines).
        Hunk #6 succeeded at 1788 (offset 6 lines).
        Hunk #7 succeeded at 1803 (offset 6 lines).
        1 out of 7 hunks FAILED – saving rejects to file solr/example/solr/collection1/conf/solrconfig.xml.rej
        patching file solr/example/exampledocs/mem.xml
        patching file solr/contrib/joins/ivy.xml
        patching file solr/contrib/joins/src/java/org/apache/solr/joins/CacheSet.java
        patching file solr/contrib/joins/src/java/org/apache/solr/joins/SegmentBitSetCollector.java
        patching file solr/contrib/joins/src/java/org/apache/solr/joins/PostFilterJoinQParserPlugin.java
        patching file solr/contrib/joins/src/java/org/apache/solr/joins/ValueSourceJoinParserPlugin.java
        patching file solr/contrib/joins/README.txt
        patching file solr/contrib/joins/build.xml

        Show
        Kranti Parisa added a comment - I am trying to apply this patch on branch_4x code ( http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x ). but getting the following error, any idea? wget https://issues.apache.org/jira/secure/attachment/12587067/SOLR-4787.patch -O - | patch -p0 --dry-run - 2013-06-23 17:35:32 - https://issues.apache.org/jira/secure/attachment/12587067/SOLR-4787.patch Resolving issues.apache.org... 140.211.11.121 Connecting to issues.apache.org|140.211.11.121|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 26641 (26K) [text/x-patch] Saving to: `STDOUT' 100% [===========================================================>] 26,641 143K/s in 0.2s 2013-06-23 17:35:32 (143 KB/s) - written to stdout [26641/26641] patching file solr/example/solr/collection1/conf/solrconfig.xml Hunk #1 FAILED at 81. Hunk #2 succeeded at 515 (offset -2 lines). Hunk #3 succeeded at 564 (offset -2 lines). Hunk #4 succeeded at 1535 (offset 6 lines). Hunk #5 succeeded at 1548 (offset 6 lines). Hunk #6 succeeded at 1788 (offset 6 lines). Hunk #7 succeeded at 1803 (offset 6 lines). 1 out of 7 hunks FAILED – saving rejects to file solr/example/solr/collection1/conf/solrconfig.xml.rej patching file solr/example/exampledocs/mem.xml patching file solr/contrib/joins/ivy.xml patching file solr/contrib/joins/src/java/org/apache/solr/joins/CacheSet.java patching file solr/contrib/joins/src/java/org/apache/solr/joins/SegmentBitSetCollector.java patching file solr/contrib/joins/src/java/org/apache/solr/joins/PostFilterJoinQParserPlugin.java patching file solr/contrib/joins/src/java/org/apache/solr/joins/ValueSourceJoinParserPlugin.java patching file solr/contrib/joins/README.txt patching file solr/contrib/joins/build.xml
        Hide
        Joel Bernstein added a comment -

        Kranti,

        I'm going to remove the solrconfig.xml changes so the patch does not complain with 4.x branch. I'll make sure the README.txt explains the changes that need to be made in the solconfig.xml. I should be able to put this up tomorrow.

        Show
        Joel Bernstein added a comment - Kranti, I'm going to remove the solrconfig.xml changes so the patch does not complain with 4.x branch. I'll make sure the README.txt explains the changes that need to be made in the solconfig.xml. I should be able to put this up tomorrow.
        Hide
        Kranti Parisa added a comment -

        cool, thanks.

        Show
        Kranti Parisa added a comment - cool, thanks.
        Hide
        Steven Bower added a comment -

        Attached is a patch (that applies on top of the latest base patch) that fixes an issue where a deadlock is caused when the fromCore is equal to the "current" core.. Basically I think what happens is that the searcher gets decref'd twice in this case causing the thread leak detection stuff in a unit test I wrote to hang waiting for this thread to die.

        Show
        Steven Bower added a comment - Attached is a patch (that applies on top of the latest base patch) that fixes an issue where a deadlock is caused when the fromCore is equal to the "current" core.. Basically I think what happens is that the searcher gets decref'd twice in this case causing the thread leak detection stuff in a unit test I wrote to hang waiting for this thread to die.
        Hide
        Steven Bower added a comment -

        New file with proper paths and a unit test, which if run with the original PostFilterJoinQParserPlugin will produce the deadlock

        Show
        Steven Bower added a comment - New file with proper paths and a unit test, which if run with the original PostFilterJoinQParserPlugin will produce the deadlock
        Hide
        Joel Bernstein added a comment -

        Kranti,

        The latest patch will apply cleanly on Solr 4x. The README.txt file has the configuration steps that are needed to get the vjoin working.

        The pjoin is not covered in the README.txt but it will be shortly.

        Show
        Joel Bernstein added a comment - Kranti, The latest patch will apply cleanly on Solr 4x. The README.txt file has the configuration steps that are needed to get the vjoin working. The pjoin is not covered in the README.txt but it will be shortly.
        Hide
        Joel Bernstein added a comment -

        Steven,

        Thanks for the patch. I'll add this to the main patch this week.

        I'm currently working on a third, more scalable, join implementation.

        If anyone knows of a good sparse bitset implementation please let me know.

        Joel

        Show
        Joel Bernstein added a comment - Steven, Thanks for the patch. I'll add this to the main patch this week. I'm currently working on a third, more scalable, join implementation. If anyone knows of a good sparse bitset implementation please let me know. Joel
        Hide
        Kranti Parisa added a comment -

        Joel,

        I am able to get the fields from the other core, pretty cool. (there are minor gaps in the README.txt and ant build scripts to include joins, hpcc jar files)

        Seems we have limitation of returning only INTs, can we support LONGs? means can we change the implementation to use LongLongOpenHashMap?

        • Kranti
        Show
        Kranti Parisa added a comment - Joel, I am able to get the fields from the other core, pretty cool. (there are minor gaps in the README.txt and ant build scripts to include joins, hpcc jar files) Seems we have limitation of returning only INTs, can we support LONGs? means can we change the implementation to use LongLongOpenHashMap? Kranti
        Hide
        Kranti Parisa added a comment - - edited

        if we have the following case

        PARENT CORE
        ============
        ID Field
        ----------
        p1
        p2
        p3

        CHILD CORE
        ============
        ID Field | Parent ID field | ValueToFetch Field
        ---------------------------------------------------------------------
        c1 | p1 | 1234
        c2 | p1 | 3456

        Then ValueToFetch will be the last one (in this case: 3456)

        May be we should think of doing

        1. Allow sort parameter to the vjoin function

        2. consider the Sort when executing the query

        3. for each PARENT ID, pick up the first ValueToFetch and ignore the rest (as we are specifying the sort preference, we are saying to collect the top document)

        4. sort param could have multiple values like general solr sort

        so vjoin will look like
        vjoin(joinCore, foreignKey, foreignVal, primaryKey, query, $vSort)

        &vSort=field1 desc, field2 asc

        5. if no sort param is specified then current implementation works (picking the the value from the last document)

        Joel, your ideas?

        Show
        Kranti Parisa added a comment - - edited if we have the following case PARENT CORE ============ ID Field ---------- p1 p2 p3 CHILD CORE ============ ID Field | Parent ID field | ValueToFetch Field --------------------------------------------------------------------- c1 | p1 | 1234 c2 | p1 | 3456 Then ValueToFetch will be the last one (in this case: 3456) May be we should think of doing 1. Allow sort parameter to the vjoin function 2. consider the Sort when executing the query 3. for each PARENT ID, pick up the first ValueToFetch and ignore the rest (as we are specifying the sort preference, we are saying to collect the top document) 4. sort param could have multiple values like general solr sort so vjoin will look like vjoin(joinCore, foreignKey, foreignVal, primaryKey, query, $vSort) &vSort=field1 desc, field2 asc 5. if no sort param is specified then current implementation works (picking the the value from the last document) Joel, your ideas?
        Hide
        Joel Bernstein added a comment -

        Kranti,

        Glad it's working for you. The one-to-many join is tricky as it is in a relational database.

        The sort idea has some implementation problems. Currently the query on the fromCore only collects the BitSet containing the matching docs. So the normal sorting collectors aren't used here. Putting them in play will cause scalability issues.

        One thing we could do is either take the MIN or MAX value. There would be a performance hit here as well but not nearly as much as the sort approach. The syntax could be

        vjoin(fromCore, fromKey, fromValue, toKey, query, MIN|MAX)

        I have no problems switching to the LongLongOpenHashMap.

        The next thing I planned to do on this ticket is support both integers and longs.

        Joel

        Show
        Joel Bernstein added a comment - Kranti, Glad it's working for you. The one-to-many join is tricky as it is in a relational database. The sort idea has some implementation problems. Currently the query on the fromCore only collects the BitSet containing the matching docs. So the normal sorting collectors aren't used here. Putting them in play will cause scalability issues. One thing we could do is either take the MIN or MAX value. There would be a performance hit here as well but not nearly as much as the sort approach. The syntax could be vjoin(fromCore, fromKey, fromValue, toKey, query, MIN|MAX) I have no problems switching to the LongLongOpenHashMap. The next thing I planned to do on this ticket is support both integers and longs. Joel
        Hide
        Kranti Parisa added a comment -

        Joel,

        Yes, sort implementation would be costly, I did review the code.
        I think having the option for MIN/MAX would help in few cases, and if we pass that as null then we are same as with current implementation.

        Having LongLongOpenHashMap would really help.

        -
        Kranti

        Show
        Kranti Parisa added a comment - Joel, Yes, sort implementation would be costly, I did review the code. I think having the option for MIN/MAX would help in few cases, and if we pass that as null then we are same as with current implementation. Having LongLongOpenHashMap would really help. - Kranti
        Hide
        Kranti Parisa added a comment - - edited

        Joel,

        I wanted to try implementing pjoin for the following use case.

        masterCore = 1M keys (id field, long)
        childCore = 5M documents (with parentid field, long, whose values are equal to the values of id field in the masterCore

        And syntax looks like
        http://localhost:8180/solr/masterCore/select?q=title:a&fq=(

        {!pjoin%20fromIndex=childCore%20from=parentid%20to=id%20v=$childQ}

        )&childQ=(fieldOne:somevalue AND fieldTwo:[1 TO 100])

        I am getting SyntaxError, but the same syntax works with normal "join". any ideas?

        Also it seems currently pjoin supports only int keys, can you please update pjoin to allow long keys

        -
        Kranti

        Show
        Kranti Parisa added a comment - - edited Joel, I wanted to try implementing pjoin for the following use case. masterCore = 1M keys (id field, long) childCore = 5M documents (with parentid field, long, whose values are equal to the values of id field in the masterCore And syntax looks like http://localhost:8180/solr/masterCore/select?q=title:a&fq=( {!pjoin%20fromIndex=childCore%20from=parentid%20to=id%20v=$childQ} )&childQ=(fieldOne:somevalue AND fieldTwo: [1 TO 100] ) I am getting SyntaxError, but the same syntax works with normal "join". any ideas? Also it seems currently pjoin supports only int keys, can you please update pjoin to allow long keys - Kranti
        Hide
        Kranti Parisa added a comment - - edited

        I did review the PostFilterJoinQParserPlugin code and found that it was expecting "fromCore" instead of "fromIndex" (like in normal Join)

        after trying "fromCore" now getting the expected error related to INT keys, as my keys are Long.


        • Kranti
        Show
        Kranti Parisa added a comment - - edited I did review the PostFilterJoinQParserPlugin code and found that it was expecting "fromCore" instead of "fromIndex" (like in normal Join) after trying "fromCore" now getting the expected error related to INT keys, as my keys are Long. Kranti
        Hide
        Kranti Parisa added a comment - - edited

        Attached the patch file (SOLR-4787-pjoin-long-keys.patch) for PostFilterJoinQParserPlugin to support LONG keys

        Show
        Kranti Parisa added a comment - - edited Attached the patch file ( SOLR-4787 -pjoin-long-keys.patch) for PostFilterJoinQParserPlugin to support LONG keys
        Hide
        Joel Bernstein added a comment -

        Kranti,

        Let me know how the pjoin is performing for you. I'm going to be testing out some different data structures for the pjoin to see if I can get better performance.

        Show
        Joel Bernstein added a comment - Kranti, Let me know how the pjoin is performing for you. I'm going to be testing out some different data structures for the pjoin to see if I can get better performance.
        Hide
        Kranti Parisa added a comment -

        Joel,

        Initial performance results looks like:
        (Restarted solr - hence no caches at the beginning)

        • with no cache: pjoin is 2-3 times faster than join
        • with cache: pjoin is 3-4 times slower than join

        Agree with your idea, we should try with other data structures and may be a look at the caching strategy used in pjoin.

        Are the queries already running in parallel to find the intersection?

        Show
        Kranti Parisa added a comment - Joel, Initial performance results looks like: (Restarted solr - hence no caches at the beginning) with no cache: pjoin is 2-3 times faster than join with cache: pjoin is 3-4 times slower than join Agree with your idea, we should try with other data structures and may be a look at the caching strategy used in pjoin. Are the queries already running in parallel to find the intersection?
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Joel Bernstein added a comment -

        Kranti,

        Odd that the pjoin cache is making things slower. I'll do some testing and see if I can turn up the same results.

        The join query runs first and builds a data structure in memory that is used to post filter the main query. The main query then runs and the post filter is applied.

        I'm exploring another scenario that will perform 5x faster then the current pjoin. But the tradeoff is a longer warmup time when a new searcher is opened.

        Do you have real-time indexing requirements or can you live with some warm-up time.

        Show
        Joel Bernstein added a comment - Kranti, Odd that the pjoin cache is making things slower. I'll do some testing and see if I can turn up the same results. The join query runs first and builds a data structure in memory that is used to post filter the main query. The main query then runs and the post filter is applied. I'm exploring another scenario that will perform 5x faster then the current pjoin. But the tradeoff is a longer warmup time when a new searcher is opened. Do you have real-time indexing requirements or can you live with some warm-up time.
        Hide
        Kranti Parisa added a comment -

        Joel,

        Thanks for the details. Yes, we do some real-time indexing. Say, every 30min we get deltas. how much warmup time that we are looking at for 5M docs?

        Also, if we have more than one pjoins in the fq, each points to their own cores, can those pjoins be executed in parallel and find the intersection which will finally be applied as a filter for the main query?

        Show
        Kranti Parisa added a comment - Joel, Thanks for the details. Yes, we do some real-time indexing. Say, every 30min we get deltas. how much warmup time that we are looking at for 5M docs? Also, if we have more than one pjoins in the fq, each points to their own cores, can those pjoins be executed in parallel and find the intersection which will finally be applied as a filter for the main query?
        Hide
        Steven Bower added a comment -

        When using the pJoin is relevance still applied to the results on the left side of the join?

        Show
        Steven Bower added a comment - When using the pJoin is relevance still applied to the results on the left side of the join?
        Hide
        Joel Bernstein added a comment -

        Yes, the main query still has relevance applied. The filter join query does not impact relevance though.

        There is going to be a large set of code changes to this ticket, probably next week. The implementation of the pjoin is changing and another more scalable join will be added as well.

        The vjoin is going to remain the same for the time being.

        Show
        Joel Bernstein added a comment - Yes, the main query still has relevance applied. The filter join query does not impact relevance though. There is going to be a large set of code changes to this ticket, probably next week. The implementation of the pjoin is changing and another more scalable join will be added as well. The vjoin is going to remain the same for the time being.
        Hide
        Kranti Parisa added a comment -

        Joel:

        That's cool, thanks. Did you ever thought of supporting "fq" parameter for the Join, like how we support for "v" for query. the idea is to use filter cache on the child core while executing the join query.

        Show
        Kranti Parisa added a comment - Joel: That's cool, thanks. Did you ever thought of supporting "fq" parameter for the Join, like how we support for "v" for query. the idea is to use filter cache on the child core while executing the join query.
        Hide
        David Smiley added a comment - - edited

        Excellent suggestion Kranti Parisa on supporting filter queries on the from-side of the join. I implemented a custom join query and it has a from-side filter query feature. It can help a great deal with performance. It wasn't that hard to add, either.

        Show
        David Smiley added a comment - - edited Excellent suggestion Kranti Parisa on supporting filter queries on the from-side of the join. I implemented a custom join query and it has a from-side filter query feature. It can help a great deal with performance. It wasn't that hard to add, either.
        Hide
        Kranti Parisa added a comment -

        That's awesome! Yes, for sure it would be a big deal for performance especially it will allow us to cache the majority of the join query which could be common for all the cases. Once it's filter cached, we can run additional clases thru normal queries super fast!

        I was thinking to add a new local-param to support "fq". If you already have something, do you want to share?

        Show
        Kranti Parisa added a comment - That's awesome! Yes, for sure it would be a big deal for performance especially it will allow us to cache the majority of the join query which could be common for all the cases. Once it's filter cached, we can run additional clases thru normal queries super fast! I was thinking to add a new local-param to support "fq". If you already have something, do you want to share?
        Hide
        David Smiley added a comment -

        I don't have the rights to it but I'll share the most pertinent line of code:

        SolrIndexSearcher.ProcessedFilter processedFilter =
                    searcher.getProcessedFilter(null, filters);
        

        See, Solr does most of the work with that one line call; everything else, such as parsing the queries is easy/common stuff.

        Show
        David Smiley added a comment - I don't have the rights to it but I'll share the most pertinent line of code: SolrIndexSearcher.ProcessedFilter processedFilter = searcher.getProcessedFilter( null , filters); See, Solr does most of the work with that one line call; everything else, such as parsing the queries is easy/common stuff.
        Hide
        Joel Bernstein added a comment -

        I'll put up what I've been working on tomorrow. One of the joins I'll be adding supports the "fq" on the from side of join. These joins also support both PostFilter and traditional filter query joins.

        Show
        Joel Bernstein added a comment - I'll put up what I've been working on tomorrow. One of the joins I'll be adding supports the "fq" on the from side of join. These joins also support both PostFilter and traditional filter query joins.
        Hide
        Joel Bernstein added a comment - - edited

        Yeah, that's the exact approach I took David.

        This was added to support nested joins but I see how the caching could really speed up the whole join.

        Show
        Joel Bernstein added a comment - - edited Yeah, that's the exact approach I took David. This was added to support nested joins but I see how the caching could really speed up the whole join.
        Hide
        Kranti Parisa added a comment -

        Great, thanks!

        Show
        Kranti Parisa added a comment - Great, thanks!
        Hide
        Kranti Parisa added a comment - - edited

        Nested Joins! That's exactly what I am trying and thought about adding fq to the solr joins.

        Using local-param in the first join:

         {!join fromIndex=a from=f1 to=f2 v=$joinQ}&joinQ=(field:123 AND _query_={another join})

        . So here "another join" could be passed as a FQ and it should get results faster!! Hence the above, query would look like,

         {!join fromIndex=a from=f1 to=f2 v=$joinQ fq=$joinFQ}&joinQ=(field:123)&joinFQ={another join}
        Show
        Kranti Parisa added a comment - - edited Nested Joins! That's exactly what I am trying and thought about adding fq to the solr joins. Using local-param in the first join: {!join fromIndex=a from=f1 to=f2 v=$joinQ}&joinQ=(field:123 AND _query_={another join}) . So here "another join" could be passed as a FQ and it should get results faster!! Hence the above, query would look like, {!join fromIndex=a from=f1 to=f2 v=$joinQ fq=$joinFQ}&joinQ=(field:123)&joinFQ={another join}
        Hide
        Joel Bernstein added a comment -

        That's exactly the syntax. I'm just working out the caching details and then I'll put up the code.

        Getting the queryResultCache and FilterCache to play nicely with nested joins is tricky.

        Show
        Joel Bernstein added a comment - That's exactly the syntax. I'm just working out the caching details and then I'll put up the code. Getting the queryResultCache and FilterCache to play nicely with nested joins is tricky.
        Hide
        Kranti Parisa added a comment -

        Cool, will you be able to put the code up here sometime tomorrow? I want to apply that patch and see how it performs.

        Show
        Kranti Parisa added a comment - Cool, will you be able to put the code up here sometime tomorrow? I want to apply that patch and see how it performs.
        Hide
        Joel Bernstein added a comment -

        Yes. I'm very close now. I'll also need to write some quick docs because these joins have a lot more functionality.

        Show
        Joel Bernstein added a comment - Yes. I'm very close now. I'll also need to write some quick docs because these joins have a lot more functionality.
        Hide
        Kranti Parisa added a comment -

        Awesome!

        Show
        Kranti Parisa added a comment - Awesome!
        Hide
        Joel Bernstein added a comment -

        New patch.

        Show
        Joel Bernstein added a comment - New patch.
        Hide
        Joel Bernstein added a comment -

        Any recommendations for a good sparse bitset implementation?

        Show
        Joel Bernstein added a comment - Any recommendations for a good sparse bitset implementation?
        Hide
        David Smiley added a comment -

        See these exciting additions to Lucene 4.5:

        • LUCENE-5084: Added new Elias-Fano encoder, decoder and DocIdSet
          implementations. (Paul Elschot via Adrien Grand)
        • LUCENE-5081: Added WAH8DocIdSet, an in-memory doc id set implementation based
          on word-aligned hybrid encoding. (Adrien Grand)
        Show
        David Smiley added a comment - See these exciting additions to Lucene 4.5: LUCENE-5084 : Added new Elias-Fano encoder, decoder and DocIdSet implementations. (Paul Elschot via Adrien Grand) LUCENE-5081 : Added WAH8DocIdSet, an in-memory doc id set implementation based on word-aligned hybrid encoding. (Adrien Grand)
        Hide
        Joel Bernstein added a comment -

        These are great additions. Not sure I can apply them here though because I'm not setting the bits in order. I'm going to need a random access sparse implementation. I've seen some but they are LGPL.

        Show
        Joel Bernstein added a comment - These are great additions. Not sure I can apply them here though because I'm not setting the bits in order. I'm going to need a random access sparse implementation. I've seen some but they are LGPL.
        Hide
        Kranti Parisa added a comment -

        Joel,

        Seems there is something wrong with nested hjoin.

        Example:

        /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ})&aQ=(f1:false)&BJoinQ=({!join fromIndex=BCore from=bid to=aid}tag:abc)

        The above query gives me 25558 results

        and when I try both joins with hjoin, as follows, it gives me 1 document

        /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ})&aQ=(f1:false)&BJoinQ=({!hjoin fromIndex=BCore from=bid to=aid}tag:abc)

        am I missing anything?

        Show
        Kranti Parisa added a comment - Joel, Seems there is something wrong with nested hjoin. Example: /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ})&aQ=(f1: false )&BJoinQ=({!join fromIndex=BCore from=bid to=aid}tag:abc) The above query gives me 25558 results and when I try both joins with hjoin, as follows, it gives me 1 document /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ})&aQ=(f1: false )&BJoinQ=({!hjoin fromIndex=BCore from=bid to=aid}tag:abc) am I missing anything?
        Hide
        Joel Bernstein added a comment -

        Are all join keys single value? Currently the hjoin and bjoin only support single value join keys.

        Show
        Joel Bernstein added a comment - Are all join keys single value? Currently the hjoin and bjoin only support single value join keys.
        Hide
        Kranti Parisa added a comment -

        The values in "to" fields are single values but "from" fields are multi valued. does this mean that the implementation need significant changes to support multi values? I will take a look at the code today.

        Show
        Kranti Parisa added a comment - The values in "to" fields are single values but "from" fields are multi valued. does this mean that the implementation need significant changes to support multi values? I will take a look at the code today.
        Hide
        Joel Bernstein added a comment - - edited

        Right now there aren't really efficient memory structures to perform the joins on multi-value fields. Our best bet right now would be the SORTED_SET docValues described http://wiki.apache.org/solr/DocValues. But this is not really designed for integers or longs.

        I think the best way to handle this is to fully normalize the data so that the join keys are single valued. Basically model the data the way you would in a relational database then use the nested joins to join the normalized indexes together.

        Show
        Joel Bernstein added a comment - - edited Right now there aren't really efficient memory structures to perform the joins on multi-value fields. Our best bet right now would be the SORTED_SET docValues described http://wiki.apache.org/solr/DocValues . But this is not really designed for integers or longs. I think the best way to handle this is to fully normalize the data so that the join keys are single valued. Basically model the data the way you would in a relational database then use the nested joins to join the normalized indexes together.
        Hide
        Kranti Parisa added a comment -

        I have 5 million documents (might increase in future) each having 10 values in the parentid field. So if we normalize the size of the index would become 50 million documents which would slow down the indexing as well search. I don't mind trying with String keys with DocValues (SORTED_SET) if it works.

        Show
        Kranti Parisa added a comment - I have 5 million documents (might increase in future) each having 10 values in the parentid field. So if we normalize the size of the index would become 50 million documents which would slow down the indexing as well search. I don't mind trying with String keys with DocValues (SORTED_SET) if it works.
        Hide
        Joel Bernstein added a comment -

        The current implementation doesn't support the multi-value fields though. So it will need to be implemented.

        Show
        Joel Bernstein added a comment - The current implementation doesn't support the multi-value fields though. So it will need to be implemented.
        Hide
        Kranti Parisa added a comment -

        Joel, I modified the JoinQParserPlugin (default one) to allow FQs. It seems to be working fine, I will need to test more for caches/performance. Do you have any updates for supporting multi-valued keys with hjoin or bjoin?

        Show
        Kranti Parisa added a comment - Joel, I modified the JoinQParserPlugin (default one) to allow FQs. It seems to be working fine, I will need to test more for caches/performance. Do you have any updates for supporting multi-valued keys with hjoin or bjoin?
        Hide
        Joel Bernstein added a comment - - edited

        Kranti, the bjoin now supports multi-value fields. I'll work on getting the patch up here today.

        Show
        Joel Bernstein added a comment - - edited Kranti, the bjoin now supports multi-value fields. I'll work on getting the patch up here today.
        Hide
        Kranti Parisa added a comment -

        That's cool. Once it is up, I will run some performance tests and post my findings. So it also supports FQs for nested joins and uses filter caches, right?

        Show
        Kranti Parisa added a comment - That's cool. Once it is up, I will run some performance tests and post my findings. So it also supports FQs for nested joins and uses filter caches, right?
        Hide
        Joel Bernstein added a comment -

        Kranti, new patch is up with the bjoin that supports multi-value fields. It supports nested fq and filter caching as well.

        I won't be able to work on the hjoin for a while, so feel free to port the multi-value field support to hjoin.

        Show
        Joel Bernstein added a comment - Kranti, new patch is up with the bjoin that supports multi-value fields. It supports nested fq and filter caching as well. I won't be able to work on the hjoin for a while, so feel free to port the multi-value field support to hjoin.
        Hide
        Kranti Parisa added a comment -

        Yes, will first test the bjoin for multi-valued fields and then try to extend hjoin for multi-value fields.

        Show
        Kranti Parisa added a comment - Yes, will first test the bjoin for multi-valued fields and then try to extend hjoin for multi-value fields.
        Hide
        Kranti Parisa added a comment - - edited

        Something is missing in the Patch? I am seeing ByteArray compilation problem. Also does bjoin needs any specific types of field configs in schema.xml ?

        Show
        Kranti Parisa added a comment - - edited Something is missing in the Patch? I am seeing ByteArray compilation problem. Also does bjoin needs any specific types of field configs in schema.xml ?
        Hide
        Kranti Parisa added a comment - - edited

        I have implemented multi-value keys for hjoin using a new field UnIvertedLongField. Sanity checks looks good. Also tested with FQs (nested Joins). I will run some performance tests and prepare the patch sometime tomorrow.

        Show
        Kranti Parisa added a comment - - edited I have implemented multi-value keys for hjoin using a new field UnIvertedLongField. Sanity checks looks good. Also tested with FQs (nested Joins). I will run some performance tests and prepare the patch sometime tomorrow.
        Hide
        Kranti Parisa added a comment -

        patch for hjoin to support multi-value keys both int and longs. I have created this patch on TRUNK (Solr 5.0)

        Show
        Kranti Parisa added a comment - patch for hjoin to support multi-value keys both int and longs. I have created this patch on TRUNK (Solr 5.0)
        Hide
        Peter Keegan added a comment -

        I'm seeing ByteArray compilation problem, too. Where would I find this class?

        Show
        Peter Keegan added a comment - I'm seeing ByteArray compilation problem, too. Where would I find this class?
        Hide
        Joel Bernstein added a comment -

        Yes, I noticed the latest patch is reffering to ByteArray which isn't present. I'm going to be putting up new patch shortly to resolve this. It will also include the latest work done on the BitSet join.

        Show
        Joel Bernstein added a comment - Yes, I noticed the latest patch is reffering to ByteArray which isn't present. I'm going to be putting up new patch shortly to resolve this. It will also include the latest work done on the BitSet join.
        Hide
        Joel Bernstein added a comment -

        This patch resolves a compile issue in the last patch and has the latest work that was done for the bjoin. The hjoin work that Kranti has worked on has not yet included.

        Show
        Joel Bernstein added a comment - This patch resolves a compile issue in the last patch and has the latest work that was done for the bjoin. The hjoin work that Kranti has worked on has not yet included.
        Hide
        Peter Keegan added a comment -

        Thanks, just tried the latest patch. For identical queries in my test index, 'bjoin' is twice as fast as 'hjoin' for both small and large inner set sizes.

        Show
        Peter Keegan added a comment - Thanks, just tried the latest patch. For identical queries in my test index, 'bjoin' is twice as fast as 'hjoin' for both small and large inner set sizes.
        Hide
        Kranti Parisa added a comment -

        Yes, but you might see different results (especially the memory) when you have long keys. If you don't have memory restrictions then yes "bjoin" should perform better.

        Show
        Kranti Parisa added a comment - Yes, but you might see different results (especially the memory) when you have long keys. If you don't have memory restrictions then yes "bjoin" should perform better.
        Hide
        Kranti Parisa added a comment - - edited

        I have recently extended the hjoin further to support multiple FQs separated by comma (,)

        /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ,$AlocalFQ})&aQ=(f1:false)&BJoinQ=({!hjoin fromIndex=BCore from=bid to=aid}tag:abc)&AlocalFQ=(fieldName:value)
        

        This will allow using the filter caches for multiple nested queries while using the hjoin like how solr supports multiple FQ params within the same request.

        Any feedback for the syntax? is comma separated FQs (eg: fq=$BJoinQ,$AlocalFQ) sounds ok?

        Show
        Kranti Parisa added a comment - - edited I have recently extended the hjoin further to support multiple FQs separated by comma (,) /masterCore/select?q=*:*&fq=({!hjoin fromIndex=ACore from=parentid to=id v=$aQ fq=$BJoinQ,$AlocalFQ})&aQ=(f1: false )&BJoinQ=({!hjoin fromIndex=BCore from=bid to=aid}tag:abc)&AlocalFQ=(fieldName:value) This will allow using the filter caches for multiple nested queries while using the hjoin like how solr supports multiple FQ params within the same request. Any feedback for the syntax? is comma separated FQs (eg: fq=$BJoinQ,$AlocalFQ ) sounds ok?
        Hide
        Joel Bernstein added a comment -

        Resolved a memory leak when the bjoin is used with cache autowarming.

        Show
        Joel Bernstein added a comment - Resolved a memory leak when the bjoin is used with cache autowarming.
        Hide
        Upayavira added a comment -

        Happy to be ignored, but wouldn't

        {!bitsetjoin}

        and

        {!hashjoin}

        be more descriptive and therefore more useful? It would mean people would get a more intuitive sense of what this is doing before they have to resort to documentation.

        Show
        Upayavira added a comment - Happy to be ignored, but wouldn't {!bitsetjoin} and {!hashjoin} be more descriptive and therefore more useful? It would mean people would get a more intuitive sense of what this is doing before they have to resort to documentation.
        Hide
        David Smiley added a comment -

        +1 to {!bitsetjoin} and {!hashjoin}

        Show
        David Smiley added a comment - +1 to {!bitsetjoin} and {!hashjoin}
        Hide
        Alexander S. added a comment -

        Which release does have support for

        {!join}

        with fq parameter? I was trying with 4.5.1 but fq seems does not have any effect.

        Show
        Alexander S. added a comment - Which release does have support for {!join} with fq parameter? I was trying with 4.5.1 but fq seems does not have any effect.
        Hide
        Alexander S. added a comment -

        Just tried 4.7.0 and it does not work either.

        Show
        Alexander S. added a comment - Just tried 4.7.0 and it does not work either.
        Hide
        Joel Bernstein added a comment - - edited

        Hi Alexander,

        This ticket has not been committed. There are two joins described on the list of QParserPlugins here:

        https://cwiki.apache.org/confluence/display/solr/Other+Parsers

        Joel

        Show
        Joel Bernstein added a comment - - edited Hi Alexander, This ticket has not been committed. There are two joins described on the list of QParserPlugins here: https://cwiki.apache.org/confluence/display/solr/Other+Parsers Joel
        Hide
        Alexander S. added a comment - - edited

        Hi Joel, thanks, I seems need to perform a nested join inside a single collection, but need fq inside join as it is shown here: https://issues.apache.org/jira/browse/SOLR-4787?focusedCommentId=13750854&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13750854

        I have a single collection with a field type which determines the kind of document. 3 types of documents: Profile, Site, and SiteSource.
        When searching for Profiles I have to look in SiteSource content, so I need something like this:

        q = {!join from=owner_id_im to=id_i fq=$joinFilter1 v=$joinQuery1} # Profile → Site join
        joinQuery1 = {!join from=site_id_i to=id_i fq=$joinFilter2 v=$joinQuery2} # Site → SiteSource join
        joinQuery2 = {!edismax}my_keywords
        joinFilter1 = "type:Site"
        joinFilter2 = "type:SiteSource"
        

        Right now this works only partially, fq inside {!join} is ignored.
        When to expect this patch to be merged? Also, will it work in the way I've explained or do I understand it wrong?

        Thank you,
        Alex

        Show
        Alexander S. added a comment - - edited Hi Joel, thanks, I seems need to perform a nested join inside a single collection, but need fq inside join as it is shown here: https://issues.apache.org/jira/browse/SOLR-4787?focusedCommentId=13750854&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13750854 I have a single collection with a field type which determines the kind of document. 3 types of documents: Profile, Site, and SiteSource. When searching for Profiles I have to look in SiteSource content, so I need something like this: q = {!join from=owner_id_im to=id_i fq=$joinFilter1 v=$joinQuery1} # Profile → Site join joinQuery1 = {!join from=site_id_i to=id_i fq=$joinFilter2 v=$joinQuery2} # Site → SiteSource join joinQuery2 = {!edismax}my_keywords joinFilter1 = "type:Site" joinFilter2 = "type:SiteSource" Right now this works only partially, fq inside {!join} is ignored. When to expect this patch to be merged? Also, will it work in the way I've explained or do I understand it wrong? Thank you, Alex
        Hide
        Kranti Parisa added a comment -

        Alex,

        I will try to create a patch today for nested joins and post here. Which version of solr are you using?

        Show
        Kranti Parisa added a comment - Alex, I will try to create a patch today for nested joins and post here. Which version of solr are you using?
        Hide
        Alexander S. added a comment -

        Hi, 4.4 and 4.7

        Show
        Alexander S. added a comment - Hi, 4.4 and 4.7
        Hide
        Kranti Parisa added a comment - - edited
        Show
        Kranti Parisa added a comment - - edited Alex, You may try the Patch ( https://issues.apache.org/jira/secure/attachment/12632860/SOLR-4797-hjoin-multivaluekeys-nestedJoins.patch ) for Nested Joins.
        Hide
        Alexander S. added a comment -

        Thank you, Kranti Parisa, I am far from java development, how can I apply this patch and build solr for linux? I tried to patch, it creates a new folder "joins" in solr/contrib, installed ivy and launched "ant compile" but got this error:

        common.compile-core:
        [mkdir] Created dir: /home/heaven/Desktop/solr-4.7.0/solr/build/contrib/solr-joins/classes/java
        [javac] Compiling 3 source files to /home/heaven/Desktop/solr-4.7.0/solr/build/contrib/solr-joins/classes/java
        [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6
        [javac] /home/heaven/Desktop/solr-4.7.0/solr/contrib/joins/src/java/org/apache/solr/joins/HashSetJoinQParserPlugin.java:883: error: reached end of file while parsing
        [javac] return this.delegate.acceptsDocsOutOfOrder();
        [javac] ^
        [javac] /home/heaven/Desktop/solr-4.7.0/solr/contrib/joins/src/java/org/apache/solr/joins/HashSetJoinQParserPlugin.java:884: error: reached end of file while parsing
        [javac] 2 errors
        [javac] 1 warning

        BUILD FAILED
        /home/heaven/Desktop/solr-4.7.0/build.xml:106: The following error occurred while executing this line:
        /home/heaven/Desktop/solr-4.7.0/solr/common-build.xml:458: The following error occurred while executing this line:
        /home/heaven/Desktop/solr-4.7.0/solr/common-build.xml:449: The following error occurred while executing this line:
        /home/heaven/Desktop/solr-4.7.0/lucene/common-build.xml:471: The following error occurred while executing this line:
        /home/heaven/Desktop/solr-4.7.0/lucene/common-build.xml:1736: Compile failed; see the compiler error output for details.

        Total time: 8 minutes 55 seconds

        Show
        Alexander S. added a comment - Thank you, Kranti Parisa, I am far from java development, how can I apply this patch and build solr for linux? I tried to patch, it creates a new folder "joins" in solr/contrib, installed ivy and launched "ant compile" but got this error: common.compile-core: [mkdir] Created dir: /home/heaven/Desktop/solr-4.7.0/solr/build/contrib/solr-joins/classes/java [javac] Compiling 3 source files to /home/heaven/Desktop/solr-4.7.0/solr/build/contrib/solr-joins/classes/java [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] /home/heaven/Desktop/solr-4.7.0/solr/contrib/joins/src/java/org/apache/solr/joins/HashSetJoinQParserPlugin.java:883: error: reached end of file while parsing [javac] return this.delegate.acceptsDocsOutOfOrder(); [javac] ^ [javac] /home/heaven/Desktop/solr-4.7.0/solr/contrib/joins/src/java/org/apache/solr/joins/HashSetJoinQParserPlugin.java:884: error: reached end of file while parsing [javac] 2 errors [javac] 1 warning BUILD FAILED /home/heaven/Desktop/solr-4.7.0/build.xml:106: The following error occurred while executing this line: /home/heaven/Desktop/solr-4.7.0/solr/common-build.xml:458: The following error occurred while executing this line: /home/heaven/Desktop/solr-4.7.0/solr/common-build.xml:449: The following error occurred while executing this line: /home/heaven/Desktop/solr-4.7.0/lucene/common-build.xml:471: The following error occurred while executing this line: /home/heaven/Desktop/solr-4.7.0/lucene/common-build.xml:1736: Compile failed; see the compiler error output for details. Total time: 8 minutes 55 seconds
        Hide
        Alexander S. added a comment -

        Nvm, there were 3 missing "}" at the end of HashSetJoinQParserPlugin.java, the build was successful, testing now.

        Show
        Alexander S. added a comment - Nvm, there were 3 missing "}" at the end of HashSetJoinQParserPlugin.java, the build was successful, testing now.
        Hide
        Alexander S. added a comment -

        Kranti,

        Do I need to update anything in my solr config/schema? I've just tried the patched version and it still ignores the fq parameter. I was using solr 4.7.0.

        Thanks,
        Alex

        Show
        Alexander S. added a comment - Kranti, Do I need to update anything in my solr config/schema? I've just tried the patched version and it still ignores the fq parameter. I was using solr 4.7.0. Thanks, Alex
        Hide
        Kranti Parisa added a comment -

        Alex,

        Are you using HashSetJoin? Did you configure in solrconfig.xml?

        Show
        Kranti Parisa added a comment - Alex, Are you using HashSetJoin? Did you configure in solrconfig.xml?
        Hide
        Alexander S. added a comment -

        Hi, I am using simple join, this way:

        {!join from=profile_ids_im to=id_i fq=$joinFilter1 v=$joinQuery1}

        .

        Show
        Alexander S. added a comment - Hi, I am using simple join, this way: {!join from=profile_ids_im to=id_i fq=$joinFilter1 v=$joinQuery1} .
        Hide
        Kranti Parisa added a comment -

        NestedJoins (fqs) are implemented in HashSetJoin. so after applying the patch you will need to configure it in solrconfig.xml

        <queryParser name="hjoin" class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>

        and use

        {!hjoin from=profile_ids_im to=id_i fq=$joinFilter1 v=$joinQuery1}

        , so you are trying to do a self join on the same core?

        Show
        Kranti Parisa added a comment - NestedJoins (fqs) are implemented in HashSetJoin. so after applying the patch you will need to configure it in solrconfig.xml <queryParser name="hjoin" class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/> and use {!hjoin from=profile_ids_im to=id_i fq=$joinFilter1 v=$joinQuery1} , so you are trying to do a self join on the same core?
        Hide
        Alexander S. added a comment -

        Ok, thx, I'll try with hjoin. And yes, I am trying to do it on the same core.

        Show
        Alexander S. added a comment - Ok, thx, I'll try with hjoin. And yes, I am trying to do it on the same core.
        Hide
        Alexander S. added a comment -

        Getting this error:

        RSolr::Error::Http - 500 Internal Server Error
        Error:     {msg=SolrCore 'crm-dev' is not available due to init failure: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin',trace=org.apache.solr.common.SolrException: SolrCore 'crm-dev' is not available due to init failure: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin'
        	at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:827)
        	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:309)
        	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
        	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
        	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
        	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
        	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
        
        Show
        Alexander S. added a comment - Getting this error: RSolr::Error::Http - 500 Internal Server Error Error: {msg=SolrCore 'crm-dev' is not available due to init failure: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin',trace=org.apache.solr.common.SolrException: SolrCore 'crm-dev' is not available due to init failure: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin' at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:827) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:309) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
        Hide
        Kranti Parisa added a comment -

        can you post the query

        Show
        Kranti Parisa added a comment - can you post the query
        Hide
        Alexander S. added a comment -

        Any query fails, seems I am doing something wrong (perhaps the patch was applied incorrectly). I see this error:

        SolrCore Initialization Failures
        crm-dev: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin'

        when trying to access the web interface.

        Show
        Alexander S. added a comment - Any query fails, seems I am doing something wrong (perhaps the patch was applied incorrectly). I see this error: SolrCore Initialization Failures crm-dev: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.search.joins.HashSetJoinQParserPlugin' when trying to access the web interface.
        Hide
        Gopal Patwa added a comment - - edited

        I am trying to use this patch (27/Jan/14 12:26) using hjoin and mismatch type issue was resolved it was my bad, I had join id with different type.

        Is it possible to collect data from hjoin collection i.e fromIndex and append to main query result? In my usecase I need to use hjoin and also show fields from fromIndex.

        Show
        Gopal Patwa added a comment - - edited I am trying to use this patch (27/Jan/14 12:26) using hjoin and mismatch type issue was resolved it was my bad, I had join id with different type. Is it possible to collect data from hjoin collection i.e fromIndex and append to main query result? In my usecase I need to use hjoin and also show fields from fromIndex.
        Hide
        Kranti Parisa added a comment -

        Gopal, you can't get the values using joins. you will need to make a second call with the result (potentially sorted and paginated on firstCore). Using FQs in the first join call, you can hit the caches in the second call. if you need more details, describe your use case

        Show
        Kranti Parisa added a comment - Gopal, you can't get the values using joins. you will need to make a second call with the result (potentially sorted and paginated on firstCore). Using FQs in the first join call, you can hit the caches in the second call. if you need more details, describe your use case
        Hide
        Gopal Patwa added a comment -

        Thanks Kranti, here is my usecase

        Event Collection:
        eventId=1
        title=Lady Gaga
        date=06/03/2014

        EventTicketStats Collection
        eventId=1
        minPrice=200
        minQuantity=5

        When user search for "lady gaga" on event document using hjoin with EventTicketStats then result should include min price and qty data from join core.

        Final Result for Event Collection:
        eventId=1
        title=Lady Gaga
        date=06/03/2014
        minPrice=200
        minQuantity=5

        And user has option to filter result for price and qty like show events for minPrice < 100
        The reason we have EventStats in separate document that our ticket data changes every 5 seconds but Event data changes are like twice a day

        I thought using Updatable Numeric DocValue after denormalizing Event document with min price and qty fields But Solr does not have support for that feature yet. So I need to rely on using join

        Show
        Gopal Patwa added a comment - Thanks Kranti, here is my usecase Event Collection: eventId=1 title=Lady Gaga date=06/03/2014 EventTicketStats Collection eventId=1 minPrice=200 minQuantity=5 When user search for "lady gaga" on event document using hjoin with EventTicketStats then result should include min price and qty data from join core. Final Result for Event Collection: eventId=1 title=Lady Gaga date=06/03/2014 minPrice=200 minQuantity=5 And user has option to filter result for price and qty like show events for minPrice < 100 The reason we have EventStats in separate document that our ticket data changes every 5 seconds but Event data changes are like twice a day I thought using Updatable Numeric DocValue after denormalizing Event document with min price and qty fields But Solr does not have support for that feature yet. So I need to rely on using join
        Hide
        Kranti Parisa added a comment -

        so for any query you might return one or more EVENTS matching the title search terms + filters.

        say you have 30 events matching the given criteria but your pagination is 1-10, so you would be displaying the top 10 most relevant EVENTS.. this would be the docList of your first query.. and from the ResponseWriter you would need to make a call to TICKETS core, by using the original filters + the 10 event ids and execute that request (you might need to use LocalSolrQueryRequest and pre-processed filters etc to hit the caches of the first query). and collect the field info you need for each EVENT..

        From the joins implementation point of view, there is no such thing to fetch the values or scores from the secondCore.. it would be very costly to do that.. you would need to do write some custom ResponseWriters etc which does this stuff.. especially considering your requirement of maintaing EVENTS and TICKETS separately. There is also a new feature Collapse, Expand results.. but then I am not sure about using them for your use case..

        Show
        Kranti Parisa added a comment - so for any query you might return one or more EVENTS matching the title search terms + filters. say you have 30 events matching the given criteria but your pagination is 1-10, so you would be displaying the top 10 most relevant EVENTS.. this would be the docList of your first query.. and from the ResponseWriter you would need to make a call to TICKETS core, by using the original filters + the 10 event ids and execute that request (you might need to use LocalSolrQueryRequest and pre-processed filters etc to hit the caches of the first query). and collect the field info you need for each EVENT.. From the joins implementation point of view, there is no such thing to fetch the values or scores from the secondCore.. it would be very costly to do that.. you would need to do write some custom ResponseWriters etc which does this stuff.. especially considering your requirement of maintaing EVENTS and TICKETS separately. There is also a new feature Collapse, Expand results.. but then I am not sure about using them for your use case..
        Hide
        Alexander S. added a comment -

        Kranti Parisa

        Did you try to apply this patch to 4.7.0? I was trying to download it here: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.7.0 and then did the next steps:

        • ant compile
        • ant ivy-bootstrap
        • ant dist
          And then created a package for my Linux distributive, but no luck, Solr fails to initialize with
          <queryParser name="hjoin" class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
        Show
        Alexander S. added a comment - Kranti Parisa Did you try to apply this patch to 4.7.0? I was trying to download it here: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.7.0 and then did the next steps: ant compile ant ivy-bootstrap ant dist And then created a package for my Linux distributive, but no luck, Solr fails to initialize with <queryParser name="hjoin" class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
        Hide
        Kranti Parisa added a comment -

        Alex, I will try that tonight or tomorrow and post my findings.

        Show
        Kranti Parisa added a comment - Alex, I will try that tonight or tomorrow and post my findings.
        Hide
        Arul Kalaipandian added a comment -

        Last week, we tried the patch(SOLR-4787) in our test system & performance of hjoin is quite better than the standard join.

        But with following issues,

        1) With 'int' join fields, bjoin throws ArrayIndexOutOfBoundsException

        bjoin throws ArrayIndexOutOfBoundsException
        
        Caused by: org.apache.solr.client.solrj.SolrServerException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1
                at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155)
                ... 48 more
        Caused by: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1
                at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:282)
                at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:664)
                at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
                at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1122)
                at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:825)
                at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:942)
                at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1399)
                at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1366)
                at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
                at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
                at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
                at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
                ... 48 more
        Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
                at org.apache.lucene.util.OpenBitSet.get(OpenBitSet.java:174)
                at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:273)
                ... 61 more
        

        2) Tescases with both 'bjoin' & 'hjoin' are fails with thread leaks.

        Both hjoin & bjoin (With or witout localparam 'threads')
                        Thread[id=29, name=commitScheduler-7-thread-1, state=TIMED_WAITING, group=TGRP-VolatileQueryTest]
                        at sun.misc.Unsafe.park(Native Method)
                        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
                        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
                        at java.util.concurrent.DelayQueue.take(DelayQueue.java:164)
                        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:609)
                        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:602)
                        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
                        at java.lang.Thread.run(Thread.java:662)
        

        3) 'bjoin' throws NumberFormatException for 'long' join fields.
        It would be nice to validate the field's type before executing the join query.

        Exception with 'long' join fields
        Caused by: java.lang.NumberFormatException: Invalid shift value in prefixCoded bytes (is encoded value really an INT?)
        	at org.apache.lucene.util.NumericUtils.getPrefixCodedIntShift(NumericUtils.java:210)
        	at org.apache.lucene.util.NumericUtils$2.accept(NumericUtils.java:493)
        	at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:241)
        	at org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:308)
        	at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:653)
        	at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212)
        	at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571)
        	at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:619)
        	at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212)
        	at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571)
        	at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:546)
        	at org.apache.solr.joins.MaxInt.getMax(MaxInt.java:98)
        	at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.runJoin(BitSetJoinQParserPlugin.java:405)
        	... 31 more
        

        4. Make 'fromIndex' optional as like the standard 'join'

        Caused by: java.lang.NullPointerException
                at org.apache.solr.joins.HashSetJoinQParserPlugin$HashSetJoinQuery.hashCode(HashSetJoinQParserPlugin.java:133)
                at org.apache.solr.search.QueryResultKey.<init>(QueryResultKey.java:50)
                at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274)
                at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
                at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
                at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
                at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
                ... 48 more
        

        Index details:
        5 shards with 12 million each(11 million docs + 1 million acl)
        Both docs & acls are in same core.
        Tested with Solr 4.2.1

        Show
        Arul Kalaipandian added a comment - Last week, we tried the patch( SOLR-4787 ) in our test system & performance of hjoin is quite better than the standard join. But with following issues, 1) With 'int' join fields, bjoin throws ArrayIndexOutOfBoundsException bjoin throws ArrayIndexOutOfBoundsException Caused by: org.apache.solr.client.solrj.SolrServerException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155) ... 48 more Caused by: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:282) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:664) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1122) at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:825) at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:942) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1399) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1366) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) ... 48 more Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.util.OpenBitSet.get(OpenBitSet.java:174) at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:273) ... 61 more 2) Tescases with both 'bjoin' & 'hjoin' are fails with thread leaks. Both hjoin & bjoin (With or witout localparam 'threads') Thread [id=29, name=commitScheduler-7-thread-1, state=TIMED_WAITING, group=TGRP-VolatileQueryTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:609) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:602) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang. Thread .run( Thread .java:662) 3) 'bjoin' throws NumberFormatException for 'long' join fields. It would be nice to validate the field's type before executing the join query. Exception with 'long' join fields Caused by: java.lang.NumberFormatException: Invalid shift value in prefixCoded bytes (is encoded value really an INT?) at org.apache.lucene.util.NumericUtils.getPrefixCodedIntShift(NumericUtils.java:210) at org.apache.lucene.util.NumericUtils$2.accept(NumericUtils.java:493) at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:241) at org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:308) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:653) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:619) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:546) at org.apache.solr.joins.MaxInt.getMax(MaxInt.java:98) at org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.runJoin(BitSetJoinQParserPlugin.java:405) ... 31 more 4. Make 'fromIndex' optional as like the standard 'join' Caused by: java.lang.NullPointerException at org.apache.solr.joins.HashSetJoinQParserPlugin$HashSetJoinQuery.hashCode(HashSetJoinQParserPlugin.java:133) at org.apache.solr.search.QueryResultKey.<init>(QueryResultKey.java:50) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) ... 48 more Index details: 5 shards with 12 million each(11 million docs + 1 million acl) Both docs & acls are in same core. Tested with Solr 4.2.1
        Hide
        Alexander S. added a comment - - edited

        @Kranti Parisa, hi, any luck with this?

        Show
        Alexander S. added a comment - - edited @Kranti Parisa, hi, any luck with this?
        Hide
        Kranti Parisa added a comment -

        Arul, thanks for posting the findings.

        I don't think LONG fields are supported by bjoin.

        Show
        Kranti Parisa added a comment - Arul, thanks for posting the findings. I don't think LONG fields are supported by bjoin.
        Hide
        Arul Kalaipandian added a comment - - edited

        New patch(SOLR-4787-with-testcase-fix.patch for Solr-4.2.1) with following fix & improvement,

        • Testcase thread leaks: SolrCore(fromcore) released on 'finally' block.
        • 'fromIndex' is optional as like the standard 'join', i.e we can do join across cores or in single core('self-join').
        • BitSetJoinQParserPlugin, field validation added(From & to fields must be an 'int').
        • Basic testcases added.
        Show
        Arul Kalaipandian added a comment - - edited New patch( SOLR-4787 -with-testcase-fix.patch for Solr-4.2.1) with following fix & improvement, Testcase thread leaks: SolrCore(fromcore) released on 'finally' block. 'fromIndex' is optional as like the standard 'join', i.e we can do join across cores or in single core('self-join'). BitSetJoinQParserPlugin, field validation added(From & to fields must be an 'int'). Basic testcases added.
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.
        Hide
        Alexander S. added a comment -
        Show
        Alexander S. added a comment - It seems join doesn't work as expected, please have a look: http://lucene.472066.n3.nabble.com/Search-results-inconsistency-when-using-joins-td4149810.html
        Hide
        Kranti Parisa added a comment -

        Alexander S. Did you apply this patch to test the joins with fq?
        If you tried with the default solr join, then fq is not a supported param for the default solr joins.

        Show
        Kranti Parisa added a comment - Alexander S. Did you apply this patch to test the joins with fq? If you tried with the default solr join, then fq is not a supported param for the default solr joins.
        Hide
        Bill Bell added a comment -

        This seems like a no-brainer. Can we commit this into 5.xxx ?

        Show
        Bill Bell added a comment - This seems like a no-brainer. Can we commit this into 5.xxx ?
        Hide
        Bill Bell added a comment -

        To be consistent can we add FQ?

        Based on post by Yonik:

        The join qparser has no "fq" parameter, so that is ignored.

        -Yonik
        http://heliosearch.org - native code faceting, facet functions,
        sub-facets, off-heap data

        Show
        Bill Bell added a comment - To be consistent can we add FQ? Based on post by Yonik: The join qparser has no "fq" parameter, so that is ignored. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data

          People

          • Assignee:
            Unassigned
            Reporter:
            Joel Bernstein
          • Votes:
            8 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

            • Created:
              Updated:

              Development