Solr
  1. Solr
  2. SOLR-7954

ArrayIndexOutOfBoundsException from distributed HLL serialization logic when using using stats.field={!cardinality=1.0} in a distributed query

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.2.1
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      SolrCloud 4 node cluster.
      Ubuntu 12.04
      OS Type 64 bit

      Description

      User reports indicate that using stats.field={!cardinality=1.0}foo on a field that has extremely high cardinality on a single shard (example: 150K unique values) can lead to "ArrayIndexOutOfBoundsException: 3" on the shard during serialization of the HLL values.

      using "cardinality=0.9" (or lower) doesn't produce the same symptoms, suggesting the problem is specific to large log2m and regwidth values.

      1. SOLR-7954.patch
        12 kB
        Hoss Man
      2. SOLR-7954.patch
        4 kB
        Hoss Man
      3. SOLR-7954.patch
        2 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Modassar Ather added a comment -

          The schema used is as follows.

          <?xml version="1.0" encoding="UTF-8" ?>
          <schema name="collection" version="1.5">

          <types>
          <fieldType name="string" class="solr.StrField" sortMissingLast="true" stored="false" omitNorms="true"/>
          <fieldType name="string_dv" class="solr.StrField" sortMissingLast="true" stored="false" indexed="false" docValues="true"/>
          <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" stored="false"/>
          </types>

          <fields>
          <field name="field1" type="string" stored="true" />
          <field name="field" type="string_dv" multiValued="true" />
          <field name="version" type="long" stored="true" />
          <field name="colid" type="string" stored="true" />
          </fields>
          <uniqueKey>colid</uniqueKey>
          </schema>

          Show
          Modassar Ather added a comment - The schema used is as follows. <?xml version="1.0" encoding="UTF-8" ?> <schema name="collection" version="1.5"> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" stored="false" omitNorms="true"/> <fieldType name="string_dv" class="solr.StrField" sortMissingLast="true" stored="false" indexed="false" docValues="true"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" stored="false"/> </types> <fields> <field name="field1" type="string" stored="true" /> <field name="field" type="string_dv" multiValued="true" /> <field name=" version " type="long" stored="true" /> <field name="colid" type="string" stored="true" /> </fields> <uniqueKey>colid</uniqueKey> </schema>
          Hide
          Modassar Ather added a comment - - edited

          Following test method can be used to add data using which the exception can be reproduced. Please do the necessary changes.
          Change <ZKHOST>:<ZKPORT> to point to zkhost available and <COLECTION> to the available collection.

          public void index() throws SolrServerException, IOException {
              CloudSolrClient s = new CloudSolrClient("<ZKHOST>:<ZKPORT>");
                  int count = 0;
                  s.setDefaultCollection("<COLECTION>");
                  List<SolrInputDocument> documents = new ArrayList<>();
                  for (int i = 1; i <= 1000000; i++) {
                      SolrInputDocument doc = new SolrInputDocument();
                      doc.addField("field1", i);
                      doc.addField("colid", "val!"+i+"!-"+"ref"+i);
                      doc.addField("field", "DATA"+(12345+i));
                      documents.add(doc);
                      if((documents.size() % 10000) == 0){
                          count = count + 10000;
                      	s.add(documents);
                      	System.out.println(System.currentTimeMillis() + " - Indexed document # " + NumberFormat.getInstance().format(count));
                      	documents = new ArrayList<>();
                      }
          		}
          		
                          System.out.println("Comitting.....................................");
          		s.commit(true, true);
          		System.out.println("Optimizing.....................................");
                          s.optimize(true, true, 1);
          		s.close();
          		System.out.println("Done.....................................");
          	}
          
          Show
          Modassar Ather added a comment - - edited Following test method can be used to add data using which the exception can be reproduced. Please do the necessary changes. Change <ZKHOST>:<ZKPORT> to point to zkhost available and <COLECTION> to the available collection. public void index() throws SolrServerException, IOException { CloudSolrClient s = new CloudSolrClient("<ZKHOST>:<ZKPORT>"); int count = 0; s.setDefaultCollection("<COLECTION>"); List<SolrInputDocument> documents = new ArrayList<>(); for (int i = 1; i <= 1000000; i++) { SolrInputDocument doc = new SolrInputDocument(); doc.addField("field1", i); doc.addField("colid", "val!"+i+"!-"+"ref"+i); doc.addField("field", "DATA"+(12345+i)); documents.add(doc); if((documents.size() % 10000) == 0){ count = count + 10000; s.add(documents); System.out.println(System.currentTimeMillis() + " - Indexed document # " + NumberFormat.getInstance().format(count)); documents = new ArrayList<>(); } } System.out.println("Comitting....................................."); s.commit(true, true); System.out.println("Optimizing....................................."); s.optimize(true, true, 1); s.close(); System.out.println("Done....................................."); }
          Hide
          Modassar Ather added a comment -

          I tested following schema with the same data in field and field2. Both reproduced the problem. Then I tried to find if it is value in cardinality which is causing the issue. I tried with 100000 to 120000 document and both the field returned cardinality but after increasing it to around 150000 it caused the exception.

          <?xml version="1.0" encoding="UTF-8" ?>
          <schema name="collection" version="1.5">
          
          <types>
          <fieldType name="string" class="solr.StrField" sortMissingLast="true" stored="false" omitNorms="true"/>
          <fieldType name="string_dv" class="solr.StrField" sortMissingLast="true" stored="false" indexed="false" docValues="true"/>
          <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" stored="false"/>
          </types>
          
          <fields>
          <field name="field"         type="string_dv"   multiValued="true" />
          <field name="field1"       type="string"         stored="true" />
          <field name="field2"       type="string"         multiValued="true" />
          <field name="version"    type="long"           stored="true" />
          <field name="colid"        type="string"          stored="true" />
          </fields>
          <uniqueKey>colid</uniqueKey>
          </schema>
          

          Following is the method to index data.

          public void index() throws SolrServerException, IOException {
              CloudSolrClient s = new CloudSolrClient("<ZKHOST>:<ZKPORT>");
                  int count = 0;
                  s.setDefaultCollection("<COLECTION>");
                  List<SolrInputDocument> documents = new ArrayList<>();
                  for (int i = 1; i <= 1000000; i++) {
                      SolrInputDocument doc = new SolrInputDocument();
                      doc.addField("field1", i);
                      doc.addField("colid", "val!"+i+"!-"+"ref"+i);
                      doc.addField("field", "DATA"+(12345+i));
                      doc.addField("field2", "DATA"+(12345+i));
                      documents.add(doc);
                      if((documents.size() % 10000) == 0){
                          count = count + 10000;
                      	s.add(documents);
                      	System.out.println(System.currentTimeMillis() + " - Indexed document # " + NumberFormat.getInstance().format(count));
                      	documents = new ArrayList<>();
                      }
          		}
          		
                          System.out.println("Comitting.....................................");
          		s.commit(true, true);
          		System.out.println("Optimizing.....................................");
                          s.optimize(true, true, 1);
          		s.close();
          		System.out.println("Done.....................................");
          	}
          
          Show
          Modassar Ather added a comment - I tested following schema with the same data in field and field2. Both reproduced the problem. Then I tried to find if it is value in cardinality which is causing the issue. I tried with 100000 to 120000 document and both the field returned cardinality but after increasing it to around 150000 it caused the exception. <?xml version="1.0" encoding="UTF-8" ?> <schema name="collection" version="1.5"> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" stored="false" omitNorms="true"/> <fieldType name="string_dv" class="solr.StrField" sortMissingLast="true" stored="false" indexed="false" docValues="true"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" stored="false"/> </types> <fields> <field name="field" type="string_dv" multiValued="true" /> <field name="field1" type="string" stored="true" /> <field name="field2" type="string" multiValued="true" /> <field name="version" type="long" stored="true" /> <field name="colid" type="string" stored="true" /> </fields> <uniqueKey>colid</uniqueKey> </schema> Following is the method to index data. public void index() throws SolrServerException, IOException { CloudSolrClient s = new CloudSolrClient("<ZKHOST>:<ZKPORT>"); int count = 0; s.setDefaultCollection("<COLECTION>"); List<SolrInputDocument> documents = new ArrayList<>(); for (int i = 1; i <= 1000000; i++) { SolrInputDocument doc = new SolrInputDocument(); doc.addField("field1", i); doc.addField("colid", "val!"+i+"!-"+"ref"+i); doc.addField("field", "DATA"+(12345+i)); doc.addField("field2", "DATA"+(12345+i)); documents.add(doc); if((documents.size() % 10000) == 0){ count = count + 10000; s.add(documents); System.out.println(System.currentTimeMillis() + " - Indexed document # " + NumberFormat.getInstance().format(count)); documents = new ArrayList<>(); } } System.out.println("Comitting....................................."); s.commit(true, true); System.out.println("Optimizing....................................."); s.optimize(true, true, 1); s.close(); System.out.println("Done....................................."); }
          Hide
          Hoss Man added a comment -

          I tested following schema with the same data in field and field2. Both reproduced the problem.

          Ok good – that means the problem is not actually dependent on docValues or not – which was the most confusing and suprising part of your initial bug report.

          Then I tried to find if it is value in cardinality which is causing the issue. I tried with 100000 to 120000 document and both the field returned cardinality but after increasing it to around 150000 it caused the exception.

          ok, so somewhere arround 150K docs is the sweetspot.


          Reviewing the code you posted, i noticed a few things:

          1) every doc gets a unique value in the field you are computing stats on
          2) your query matches all docs
          3) because of how your uniqueKey is defined using composite routing keys ("!") every doc will wind up in the same shard.

          the combination of all of these means that ulitmately what's causing problems is:

          • building an HLL data struc using the max possible log2m & regwidth opts (that's what cardinality=1.0 does)
          • adding ~150K unique(ish) hash values to the HLL
          • serializing the HLL to bytes (which is what happens in a distributed query to coordinate)

          based on that, i was able to create a unit test that demonstrates the same underlying ArrayIndexOutOfBoundsException which i'll attach shortly – still haven't dug in enough to udnerstand hte cause.

          (NOTE: since Solr 5.2.1, we've forked the HLL and imported it directly into the org.apache.solr.util.hll package, but the basic structure/functionality of the various classes is still the same)

          Show
          Hoss Man added a comment - I tested following schema with the same data in field and field2. Both reproduced the problem. Ok good – that means the problem is not actually dependent on docValues or not – which was the most confusing and suprising part of your initial bug report. Then I tried to find if it is value in cardinality which is causing the issue. I tried with 100000 to 120000 document and both the field returned cardinality but after increasing it to around 150000 it caused the exception. ok, so somewhere arround 150K docs is the sweetspot. Reviewing the code you posted, i noticed a few things: 1) every doc gets a unique value in the field you are computing stats on 2) your query matches all docs 3) because of how your uniqueKey is defined using composite routing keys ("!") every doc will wind up in the same shard. the combination of all of these means that ulitmately what's causing problems is: building an HLL data struc using the max possible log2m & regwidth opts (that's what cardinality=1.0 does) adding ~150K unique(ish) hash values to the HLL serializing the HLL to bytes (which is what happens in a distributed query to coordinate) based on that, i was able to create a unit test that demonstrates the same underlying ArrayIndexOutOfBoundsException which i'll attach shortly – still haven't dug in enough to udnerstand hte cause. (NOTE: since Solr 5.2.1, we've forked the HLL and imported it directly into the org.apache.solr.util.hll package, but the basic structure/functionality of the various classes is still the same)
          Hide
          Hoss Man added a comment -

          patch demonstrating underlying problem – note that becuase this builds up some pretty large datastructures and byte arrays, you need to give it an increased heap to run...

          $ ant test -Dtestcase=BigHllSerializationTest -Dtests.heapsize=2g
          ...
             [junit4] ERROR   0.27s | BigHllSerializationTest.testSerialization <<<
             [junit4]    > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: 3
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([26FE9A72EDF24F3:56B788EF098D196F]:0)
             [junit4]    > 	at org.apache.solr.util.hll.BigEndianAscendingWordSerializer.writeWord(BigEndianAscendingWordSerializer.java:151)
             [junit4]    > 	at org.apache.solr.util.hll.BitVector.getRegisterContents(BitVector.java:244)
             [junit4]    > 	at org.apache.solr.util.hll.HLL.toBytes(HLL.java:908)
             [junit4]    > 	at org.apache.solr.util.hll.HLL.toBytes(HLL.java:859)
             [junit4]    > 	at org.apache.solr.util.hll.BigHllSerializationTest.testSerialization(BigHllSerializationTest.java:41)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          Hoss Man added a comment - patch demonstrating underlying problem – note that becuase this builds up some pretty large datastructures and byte arrays, you need to give it an increased heap to run... $ ant test -Dtestcase=BigHllSerializationTest -Dtests.heapsize=2g ... [junit4] ERROR 0.27s | BigHllSerializationTest.testSerialization <<< [junit4] > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: 3 [junit4] > at __randomizedtesting.SeedInfo.seed([26FE9A72EDF24F3:56B788EF098D196F]:0) [junit4] > at org.apache.solr.util.hll.BigEndianAscendingWordSerializer.writeWord(BigEndianAscendingWordSerializer.java:151) [junit4] > at org.apache.solr.util.hll.BitVector.getRegisterContents(BitVector.java:244) [junit4] > at org.apache.solr.util.hll.HLL.toBytes(HLL.java:908) [junit4] > at org.apache.solr.util.hll.HLL.toBytes(HLL.java:859) [junit4] > at org.apache.solr.util.hll.BigHllSerializationTest.testSerialization(BigHllSerializationTest.java:41) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          Hoss Man added a comment -

          cleaned up summary & description now that we have a better idea what the root symptoms/cause are.

          Show
          Hoss Man added a comment - cleaned up summary & description now that we have a better idea what the root symptoms/cause are.
          Hide
          Modassar Ather added a comment -

          To add to the summary and description.

          I changed the

          doc.addField("colid", "val!"+i+"!-"+"ref"+i);

          to

          doc.addField("colid", "val"+i+"!-"+"ref"+i);

          The documents got distributed to all the nodes. I indexed 1 million documents and was able to reproduce the issue. All the shards had around 200000 documents each.
          Later I indexed 400000 documents on which I could not reproduce it. All the shards had around 100000 documents each.
          There are 4 shards with no replica on my test environment.

          Show
          Modassar Ather added a comment - To add to the summary and description. I changed the doc.addField("colid", "val!"+i+"!-"+"ref"+i); to doc.addField("colid", "val"+i+"!-"+"ref"+i); The documents got distributed to all the nodes. I indexed 1 million documents and was able to reproduce the issue. All the shards had around 200000 documents each. Later I indexed 400000 documents on which I could not reproduce it. All the shards had around 100000 documents each. There are 4 shards with no replica on my test environment.
          Hide
          Hoss Man added a comment -

          Tracked down the root problem to some integer overflow - the HLL code was multiplying 2 large integers (w/o casting them individual to longs first) then assigning the (already overflowed) value to a long.

          attached patch includes fix, but i want to work on randomizing the test some more – make sure there aren't similar bugs in other code paths.

          Show
          Hoss Man added a comment - Tracked down the root problem to some integer overflow - the HLL code was multiplying 2 large integers (w/o casting them individual to longs first) then assigning the (already overflowed) value to a long. attached patch includes fix, but i want to work on randomizing the test some more – make sure there aren't similar bugs in other code paths.
          Hide
          Hoss Man added a comment -

          Later I indexed 400000 documents on which I could not reproduce it. All the shards had around 100000 documents each.

          There are 4 shards with no replica on my test environment.

          Modassar: as i tried to explain in my earlier comments, the number of shards / documents doesn't really affect the issue – the root problem has to do with the number of unique values in a single shard which are added to the underlying HyperLogLog data structure and then serialized. Doing more testing where you tweak the routing or doc counts may find differnet bugs, but for this specific bug the core problem is reviewing the HLL serialization code related to the various precision options (which are set based on the "cardinality" local param) and the number of unique (hashed) values in each HLL.

          Show
          Hoss Man added a comment - Later I indexed 400000 documents on which I could not reproduce it. All the shards had around 100000 documents each. There are 4 shards with no replica on my test environment. Modassar: as i tried to explain in my earlier comments, the number of shards / documents doesn't really affect the issue – the root problem has to do with the number of unique values in a single shard which are added to the underlying HyperLogLog data structure and then serialized. Doing more testing where you tweak the routing or doc counts may find differnet bugs, but for this specific bug the core problem is reviewing the HLL serialization code related to the various precision options (which are set based on the "cardinality" local param) and the number of unique (hashed) values in each HLL.
          Hide
          Hoss Man added a comment -

          Reviewing the code & tests more in depth, i realized a few things...

          1. actaully having a large # of unique values isn't needed to trigger this in the low level HLL code – you just need to be using the FULL representation with large enough values of log2m and regwidth (which is why at the Solr API level you have to use cardinality=1.0 AND have a lot of unique values – we defualt to usingthe sparse representation and only promote to the full representation once a lot of values are added.
          2. the original HLL code's HLLSerializationTest actaully had a test that would have caught this bug, but it was hamstrung with this lovely comment...
            // NOTE: log2m<=16 was chosen as the max log2m parameter so that the test
            //       completes in a reasonable amount of time. Not much is gained by
            //       testing larger values - there are no more known serialization
            //       related edge cases that appear as log2m gets even larger.
            // NOTE: This test completed successfully with log2m<=MAXIMUM_LOG2M_PARAM
            //       on 2014-01-30.
            

          Awesome.

          I refactored HLLSerializationTest a bit so we still have the same Nightly test coverage as before, but also some new Monster tests for exercising some random permutations of options for large sized HLLs (with only a few values) as well as some random permutations of HLLs (of various sizes) with lots of values in them. (so my previous BigHllSerializationTest is no longer needed)


          I think this is ready to commit & backport, i'll move forward tomorrow unless there are any concerns.

          Show
          Hoss Man added a comment - Reviewing the code & tests more in depth, i realized a few things... actaully having a large # of unique values isn't needed to trigger this in the low level HLL code – you just need to be using the FULL representation with large enough values of log2m and regwidth (which is why at the Solr API level you have to use cardinality=1.0 AND have a lot of unique values – we defualt to usingthe sparse representation and only promote to the full representation once a lot of values are added. the original HLL code's HLLSerializationTest actaully had a test that would have caught this bug, but it was hamstrung with this lovely comment... // NOTE: log2m<=16 was chosen as the max log2m parameter so that the test // completes in a reasonable amount of time. Not much is gained by // testing larger values - there are no more known serialization // related edge cases that appear as log2m gets even larger. // NOTE: This test completed successfully with log2m<=MAXIMUM_LOG2M_PARAM // on 2014-01-30. Awesome. I refactored HLLSerializationTest a bit so we still have the same Nightly test coverage as before, but also some new Monster tests for exercising some random permutations of options for large sized HLLs (with only a few values) as well as some random permutations of HLLs (of various sizes) with lots of values in them. (so my previous BigHllSerializationTest is no longer needed) I think this is ready to commit & backport, i'll move forward tomorrow unless there are any concerns.
          Hide
          ASF subversion and git services added a comment -

          Commit 1697969 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1697969 ]

          SOLR-7954: Fixed an integer overflow bug in the HyperLogLog code used by the 'cardinality' option of stats.field to prevent ArrayIndexOutOfBoundsException in a distributed search when a large precision is selected and a large number of values exist in each shard

          Show
          ASF subversion and git services added a comment - Commit 1697969 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1697969 ] SOLR-7954 : Fixed an integer overflow bug in the HyperLogLog code used by the 'cardinality' option of stats.field to prevent ArrayIndexOutOfBoundsException in a distributed search when a large precision is selected and a large number of values exist in each shard
          Hide
          ASF subversion and git services added a comment -

          Commit 1697977 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1697977 ]

          SOLR-7954: Fixed an integer overflow bug in the HyperLogLog code used by the 'cardinality' option of stats.field to prevent ArrayIndexOutOfBoundsException in a distributed search when a large precision is selected and a large number of values exist in each shard (merge r1697969)

          Show
          ASF subversion and git services added a comment - Commit 1697977 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1697977 ] SOLR-7954 : Fixed an integer overflow bug in the HyperLogLog code used by the 'cardinality' option of stats.field to prevent ArrayIndexOutOfBoundsException in a distributed search when a large precision is selected and a large number of values exist in each shard (merge r1697969)
          Hide
          Hoss Man added a comment -

          Thanks for reporting this Modassar!

          Show
          Hoss Man added a comment - Thanks for reporting this Modassar!
          Hide
          Hoss Man added a comment -
          Show
          Hoss Man added a comment - Upstream bug report: https://github.com/aggregateknowledge/java-hll/issues/17
          Hide
          Modassar Ather added a comment -
          q=fl1:net*&facet.field=fl&facet.limit=50&stats=true&stats.field={!cardinality=1.0}fl

          Above query is returning cardinality around 15 million. It is taking around 4 minutes. Similar response time is seen with different queries which yields high cardinality. Kindly note that the cardinality=1.0 is the desired goal.
          Here in the above example the fl1 is a text field whereas fl is a docValue enabled non-stroed, non-indexed field.
          Kindly let me know if such response time is expected or I am missing something about this feature in my query.

          Show
          Modassar Ather added a comment - q=fl1:net*&facet.field=fl&facet.limit=50&stats=true&stats.field={!cardinality=1.0}fl Above query is returning cardinality around 15 million. It is taking around 4 minutes. Similar response time is seen with different queries which yields high cardinality. Kindly note that the cardinality=1.0 is the desired goal. Here in the above example the fl1 is a text field whereas fl is a docValue enabled non-stroed, non-indexed field. Kindly let me know if such response time is expected or I am missing something about this feature in my query.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Modassar Ather
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development