Lucene - Core
  1. Lucene - Core
  2. LUCENE-5048

NegativeArraySizeException caused by ff.addFields

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2, 4.3
    • Fix Version/s: 4.4, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Environment:

      Not sure if this is applicable, but it's Gentoo Linux with java 1.7 on a dual intel xeon lga2011 with supermicro server motherboard and 256GB of ram

    • Lucene Fields:
      New, Patch Available

      Description

      I have a Server/Client software that I have created which has a server process that accepts connections from clients that transmit data about local connection information. This data is than buffered and a ThreadPoolExecutor runs to take the data and put it into a lucene index as well as a facet index. This works perfect for the lucene index, but the facet index randomly generates a NegativeArraySizeException. I cannot find any reason why the exception would be caused because lines with the same type of data do not throw it, then all of a sudden the exception is thrown, typically 4 of them in a row. I talked with mikemccand on IRC and he requested I submit this issue.

      After some discussion, he seems to think it's because some of the values I am using are rather large.

      Here is the exception...
      java.lang.NegativeArraySizeException
      at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
      at java.lang.StringBuilder.<init>(StringBuilder.java:97)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.CharBlockArray.subSequence(CharBlockArray.java:164)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.CategoryPathUtils.hashCodeOfSerialized(CategoryPathUtils.java:50)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.stringHashCode(CompactLabelToOrdinal.java:294)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.grow(CompactLabelToOrdinal.java:184)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.addLabel(CompactLabelToOrdinal.java:116)
      at org.apache.lucene.facet.taxonomy.writercache.cl2o.Cl2oTaxonomyWriterCache.put(Cl2oTaxonomyWriterCache.java:84)
      at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addToCache(DirectoryTaxonomyWriter.java:592)
      at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategoryDocument(DirectoryTaxonomyWriter.java:551)
      at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.internalAddCategory(DirectoryTaxonomyWriter.java:501)
      at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.internalAddCategory(DirectoryTaxonomyWriter.java:494)
      at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategory(DirectoryTaxonomyWriter.java:468)
      at org.apache.lucene.facet.index.FacetFields.addFields(FacetFields.java:175)
      at net.domain.NetstatIndexer.IndexJob.run(IndexJob.java:73)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:722)

      Here is an example data entry which appears when the exception occurs...
      Location: nj
      LocalIP: 10.1.200.187
      RemoteIP: 41.161.197.166
      LocalPorts: [443]
      Connections: 1
      Times: [120]
      Timestamp: 2013-06-09T12:51:00.000-07:00
      States: ["Established"]

      And here is the the code stripped down to provide an example of how I am handling the facet/doc code.

      doc.add(new TextField("Location", ehost[0], Field.Store.YES));
      cats.add(new CategoryPath("Location", doc.get("Location")));
      doc.add(new TextField("LocalIP", (String) stat.get("LocalIP"), Field.Store.YES));
      cats.add(new CategoryPath("LocalIP", doc.get("LocalIP")));
      doc.add(new TextField("RemoteIP", (String) stat.get("RemoteIP"), Field.Store.YES));
      cats.add(new CategoryPath("RemoteIP", doc.get("RemoteIP")));
      doc.add(new TextField("LocalPorts", StringUtils.join(stat.get("LocalPorts"), ","), Field.Store.YES));
      cats.add(new CategoryPath("LocalPorts", doc.get("LocalPorts")));
      doc.add(new TextField("RemotePorts", StringUtils.join(stat.get("RemotePorts"), ","), Field.Store.YES));
      cats.add(new CategoryPath("RemotePorts", doc.get("RemotePorts")));
      doc.add(new LongField("Connections", (Long) stat.get("Connections"), Field.Store.YES));
      cats.add(new CategoryPath("Connections", doc.get("Connections")));
      doc.add(new TextField("Times", StringUtils.join(stat.get("Times"), ","), Field.Store.YES));
      cats.add(new CategoryPath("Times", doc.get("Times")));
      doc.add(new TextField("Timestamp", (String) stat.get("Timestamp"), Field.Store.YES));
      cats.add(new CategoryPath("Timestamp", doc.get("Timestamp")));
      doc.add(new TextField("States", StringUtils.join(stat.get("States"), ","), Field.Store.YES));
      cats.add(new CategoryPath("States", doc.get("States")));
      System.out.println("Location: "doc.get("Location")" LocalIP: "doc.get("LocalIP")" RemoteIP: "doc.get("RemoteIP")" LocalPorts: "doc.get("LocalPorts")" Connections: "doc.get("Connections")" Times: "doc.get("Times")" Timestamp: "doc.get("Timestamp")" States: "+doc.get("States"));
      if (cats.size()!=0)

      { FacetFields ff = new FacetFields(Main.twriter); ff.addFields(doc, cats); // <-- Exception occurs here randomly }
      1. LUCENE-5048.patch
        2 kB
        Michael McCandless
      2. LUCENE-5048.patch
        7 kB
        Shai Erera
      3. LUCENE-5048.patch
        9 kB
        Shai Erera
      4. LUCENE-5048.patch
        11 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        Here's a test showing the issue.

        What makes it tricky is it's not until CompactLabelToOrdinal goes to rehash that the previously successfully indexed too-large value causes the exception, even though the current doc being indexed is fine.

        The problem is the casts to (short) in CategoryPathUtils, e.g.:

            int length = (short) charBlockArray.charAt(offset++);
        

        I think those should be cast to (int) instead? (Test passes if I fix all 3). Does anyone know why they are using (short) now?

        But, separately, indexing such immense (> 32 KB) facet labels really doesn't make much sense? Ie, most likely it's an accident in the app ... I think somewhere "higher up" we should catch this and throw IllegalArgumentException.

        Show
        Michael McCandless added a comment - Here's a test showing the issue. What makes it tricky is it's not until CompactLabelToOrdinal goes to rehash that the previously successfully indexed too-large value causes the exception, even though the current doc being indexed is fine. The problem is the casts to (short) in CategoryPathUtils, e.g.: int length = (short) charBlockArray.charAt(offset++); I think those should be cast to (int) instead? (Test passes if I fix all 3). Does anyone know why they are using (short) now? But, separately, indexing such immense (> 32 KB) facet labels really doesn't make much sense? Ie, most likely it's an accident in the app ... I think somewhere "higher up" we should catch this and throw IllegalArgumentException.
        Hide
        Colton Jamieson added a comment -

        I modified my code so that it declared the categorypaths separately instead of as a string joined together, which makes sense to do anyway...

        JSONArray states = (JSONArray) stat.get("States");
        for (Object state : states)

        { cats.add(new CategoryPath("States", (String) state)); }

        Since doing this I have not see the issue, so it does appear Michael was correct, this was caused by having categorypaths with values to large. Thanks!

        Show
        Colton Jamieson added a comment - I modified my code so that it declared the categorypaths separately instead of as a string joined together, which makes sense to do anyway... JSONArray states = (JSONArray) stat.get("States"); for (Object state : states) { cats.add(new CategoryPath("States", (String) state)); } Since doing this I have not see the issue, so it does appear Michael was correct, this was caused by having categorypaths with values to large. Thanks!
        Hide
        Shai Erera added a comment -

        Good catch. The test doesn't always reproduce though, it repro with this seed: -Dtests.seed=5229F4C07527089D.
        I modified it to fail on even less categories, just had to initialize a Cl2oTaxoWriterCache with smaller initial capacity.

        Anyway, the problem is the cast to short. In Java, short (and int) are signed. So what happens is:

        • The test generates a random unicode string of length 32767, but its length() is 65534 which is 0xFFFE.
        • Cast that to short, you get -2.
        • Cast it to char, you get 0xFFFE (since char is the only primitive that's unsigned)
        • And of course casting to int is unnecessary.

        The code serializes the length of each component in a char[], as a char. During serialization, it casts to char, since it appends it to a char[]. During deserialization, it wrongly casts to short (should cast to char, which you then get a warning on unnecessary cast). I guess this never showed up since nobody yet tried to index components of length 16K+ .

        So to fix this we indeed need to remove the cast to short. But also, there's a bigger bug here – since the length is assumed to be less than 65536, if you index a very large category twice (or a substring of it), I think there will be an issue. So I added to test adding same category again, after re-hash, and voila, it was added again, with a different ordinal. Therefore I added a check in CategoryPath which prevents the creation of CPs with very large components (> 65535 chars). I'll post a modified patch soon.

        Show
        Shai Erera added a comment - Good catch. The test doesn't always reproduce though, it repro with this seed: -Dtests.seed=5229F4C07527089D . I modified it to fail on even less categories, just had to initialize a Cl2oTaxoWriterCache with smaller initial capacity. Anyway, the problem is the cast to short. In Java, short (and int) are signed. So what happens is: The test generates a random unicode string of length 32767, but its length() is 65534 which is 0xFFFE. Cast that to short, you get -2. Cast it to char, you get 0xFFFE (since char is the only primitive that's unsigned) And of course casting to int is unnecessary. The code serializes the length of each component in a char[], as a char. During serialization, it casts to char, since it appends it to a char[]. During deserialization, it wrongly casts to short (should cast to char, which you then get a warning on unnecessary cast). I guess this never showed up since nobody yet tried to index components of length 16K+ . So to fix this we indeed need to remove the cast to short. But also, there's a bigger bug here – since the length is assumed to be less than 65536, if you index a very large category twice (or a substring of it), I think there will be an issue. So I added to test adding same category again, after re-hash, and voila, it was added again, with a different ordinal. Therefore I added a check in CategoryPath which prevents the creation of CPs with very large components (> 65535 chars). I'll post a modified patch soon.
        Hide
        Shai Erera added a comment -

        Patch addresses the issues mentioned in the previous comment.

        Show
        Shai Erera added a comment - Patch addresses the issues mentioned in the previous comment.
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Maybe we should set the limit to match indexer's limit (slightly less than 32 KB I think)? Both limits are insanely large .... I wonder if they should be much smaller e.g. 256 bytes.

        Second, can you fix testHugeLabel to actually index exactly the max length label?

        Show
        Michael McCandless added a comment - Thanks Shai! Maybe we should set the limit to match indexer's limit (slightly less than 32 KB I think)? Both limits are insanely large .... I wonder if they should be much smaller e.g. 256 bytes. Second, can you fix testHugeLabel to actually index exactly the max length label?
        Hide
        Shai Erera added a comment -

        Thanks Mike. Patch indexes the facets, but has a troubling nocommit. If you run the test with -Dtests.seed=919CD019C8A94C60, it fails to DD on the term. I print the $facet terms and it's not there!

        I played with the length of the term (exhaustive binary search ) and figured that if it's set to 10,920 (something) it passes but 10,921 (something) it fails. Tracked it down to TermsHashPerField which silently (!?) drops long terms (exceeding 32766 bytes).

        The problem is, I don't know what limit is safe to use. Following the code, seems that 32766/4? The test passes with different seeds and different lengths... But that's not documented anywhere (not the limit, nor the silent dropping of terms), at least no documentation that I found.

        Now, notice how I wrote 10921 (something) above – that's another trick _TestUtil.randomRealistic plays on me. Sometimes if you ask it to generate a 100 'length' string, the result String.length() says 100 and sometimes 200 . I'll repro that and discuss separately.

        So what's the best course of action? Expose the DWPT.MAX_TERM_LENGTH_UTF8 and then limit a CP component to X/4? But that doesn't solve the other problem, that a DD-term can still be silently dropped. So seems like we should limit a CP total encoded length (w/ the separator) to MAX/4?

        Unlike document content terms, I think it's bad if we silently drop DD terms. I prefer that we fail fast on that. We can do that in DrillDownStream - if the termAttribute.length is > MAX/4, refused to add that category?

        Show
        Shai Erera added a comment - Thanks Mike. Patch indexes the facets, but has a troubling nocommit. If you run the test with -Dtests.seed=919CD019C8A94C60, it fails to DD on the term. I print the $facet terms and it's not there! I played with the length of the term (exhaustive binary search ) and figured that if it's set to 10,920 (something) it passes but 10,921 (something) it fails. Tracked it down to TermsHashPerField which silently (!?) drops long terms (exceeding 32766 bytes). The problem is, I don't know what limit is safe to use. Following the code, seems that 32766/4? The test passes with different seeds and different lengths... But that's not documented anywhere (not the limit, nor the silent dropping of terms), at least no documentation that I found. Now, notice how I wrote 10921 (something) above – that's another trick _TestUtil.randomRealistic plays on me. Sometimes if you ask it to generate a 100 'length' string, the result String.length() says 100 and sometimes 200 . I'll repro that and discuss separately. So what's the best course of action? Expose the DWPT.MAX_TERM_LENGTH_UTF8 and then limit a CP component to X/4? But that doesn't solve the other problem, that a DD-term can still be silently dropped. So seems like we should limit a CP total encoded length (w/ the separator) to MAX/4? Unlike document content terms, I think it's bad if we silently drop DD terms. I prefer that we fail fast on that. We can do that in DrillDownStream - if the termAttribute.length is > MAX/4, refused to add that category?
        Hide
        Michael McCandless added a comment -

        We can do that in DrillDownStream - if the termAttribute.length is > MAX/4, refused to add that category?

        +1

        Show
        Michael McCandless added a comment - We can do that in DrillDownStream - if the termAttribute.length is > MAX/4, refused to add that category? +1
        Hide
        Shai Erera added a comment -

        I added CP.MAX_CATEGORY_PATH_LENGTH = (BYTE_BLOCK_SIZE - 2) / 4 - so now you cannot create such a long CP at all (ctors verify that).

        I also added _TestUtil.randomSimpleString(r,min,max) and modified the test to use it instead of randomRealistic – since we limit by characters, not codepoints, using randomRealisticUnicodeString sometimes failed on too large components, which is not what the test checks.

        I think now it's ready!

        Show
        Shai Erera added a comment - I added CP.MAX_CATEGORY_PATH_LENGTH = (BYTE_BLOCK_SIZE - 2) / 4 - so now you cannot create such a long CP at all (ctors verify that). I also added _TestUtil.randomSimpleString(r,min,max) and modified the test to use it instead of randomRealistic – since we limit by characters, not codepoints, using randomRealisticUnicodeString sometimes failed on too large components, which is not what the test checks. I think now it's ready!
        Hide
        Michael McCandless added a comment -

        +1

        One minor thing: you no longer have to check for \u001f: randomSimpleString never returns that char.

        Show
        Michael McCandless added a comment - +1 One minor thing: you no longer have to check for \u001f: randomSimpleString never returns that char.
        Hide
        Shai Erera added a comment -

        I agree. Technically it's even wrong to use that char specifically - if at all we should use the constant from FacetIndexingParams. But I'll remove.

        Show
        Shai Erera added a comment - I agree. Technically it's even wrong to use that char specifically - if at all we should use the constant from FacetIndexingParams. But I'll remove.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] shaie
        http://svn.apache.org/viewvc?view=revision&revision=1491562

        LUCENE-5048: limit CategoryPath total length

        Show
        Commit Tag Bot added a comment - [trunk commit] shaie http://svn.apache.org/viewvc?view=revision&revision=1491562 LUCENE-5048 : limit CategoryPath total length
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x. Thanks Colton for reporting!

        Show
        Shai Erera added a comment - Committed to trunk and 4x. Thanks Colton for reporting!
        Hide
        Rob Audenaerde added a comment -

        Wow. Thanks I happened to trigger this bug yesterday, and found it very hard to reproduce. (I do have a testset which triggers this bug in about 4 seconds; if it has value I'll attach it to this issue).

        Show
        Rob Audenaerde added a comment - Wow. Thanks I happened to trigger this bug yesterday, and found it very hard to reproduce. (I do have a testset which triggers this bug in about 4 seconds; if it has value I'll attach it to this issue).
        Hide
        Shai Erera added a comment -

        Thanks Rob. The bug is already fixed (will be released in 4.4) and I was able to create a very short testcase (which adds only few categories) to reproduce it.

        Show
        Shai Erera added a comment - Thanks Rob. The bug is already fixed (will be released in 4.4) and I was able to create a very short testcase (which adds only few categories) to reproduce it.
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Shai Erera
            Reporter:
            Colton Jamieson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development