Lucene - Core
  1. Lucene - Core
  2. LUCENE-4724

TaxonomyReader drops empty string component from CategoryPath

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I ran the new PrintTaxonomyStats on a Wikipedia facets index, and it hit an AIOOBE because there was a child of the /categories path that had only one component ... this was created because I had added new CategoryPath("categories", "") during indexing.

      I think TaxoReader should preserve and return that empty string from .getPath?

      1. LUCENE-4724.patch
        5 kB
        Shai Erera
      2. LUCENE-4724.patch
        4 kB
        Michael McCandless
      3. LUCENE-4724.patch
        1 kB
        Michael McCandless

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        23h 34m 1 Shai Erera 28/Jan/13 11:37
        Resolved Resolved Closed Closed
        101d 22h 56m 1 Uwe Schindler 10/May/13 11:33
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1439355

        LUCENE-4724: disallow empty or null strings as components

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1439355 LUCENE-4724 : disallow empty or null strings as components
        Shai Erera made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Lucene Fields New [ 10121 ] New,Patch Available [ 10121, 10120 ]
        Assignee Shai Erera [ shaie ]
        Resolution Fixed [ 1 ]
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x.

        Show
        Shai Erera added a comment - Committed to trunk and 4x.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1439350

        LUCENE-4724: disallow empty or null strings as components

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1439350 LUCENE-4724 : disallow empty or null strings as components
        Hide
        Michael McCandless added a comment -

        +1, thanks Shai!

        I think it's OK if we don't catch whitespace only components ...

        Show
        Michael McCandless added a comment - +1, thanks Shai! I think it's OK if we don't catch whitespace only components ...
        Hide
        Shai Erera added a comment -

        Also, note that the patch does not forbid passing strings like " " (i.e., not empty, but following trim() will be). I figured that this whole use case is not that common, and calling trim() is not that cheap, for every well-behaved app. Yet, I do want this to be a hard exception, not an assert.

        Show
        Shai Erera added a comment - Also, note that the patch does not forbid passing strings like " " (i.e., not empty, but following trim() will be). I figured that this whole use case is not that common, and calling trim() is not that cheap, for every well-behaved app. Yet, I do want this to be a hard exception, not an assert.
        Shai Erera made changes -
        Attachment LUCENE-4724.patch [ 12566746 ]
        Hide
        Shai Erera added a comment -

        Patch forbids empty or null components in CategoryPath ctors, plus matching test in TestCategoryPath. Mike, the test you added seems unneeded now, as one cannot create such a CP anymore.

        Note though that there was a test in TestCP which asserted an empty path component in the middle of a string. I think it's not an interesting use case, and still should not be allowed.

        I also had to modify another test which randomly generated CategoryPath strings (picking characters at random).

        Show
        Shai Erera added a comment - Patch forbids empty or null components in CategoryPath ctors, plus matching test in TestCategoryPath. Mike, the test you added seems unneeded now, as one cannot create such a CP anymore. Note though that there was a test in TestCP which asserted an empty path component in the middle of a string. I think it's not an interesting use case, and still should not be allowed. I also had to modify another test which randomly generated CategoryPath strings (picking characters at random).
        Hide
        Shai Erera added a comment -

        OK I agree: let's disallow empty string at indexing time.

        And null (for the vararg ctor).

        I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string?

        I think that both public ctors should throw IllegalArgException if they encounter empty or null strings.

        And you shouldn't worry about what's out there, because previously, CP converted all strings to a char[] and I'm sure all these cases were handled. And probably TestCategoryPath had tests that covered all that. But in the whole cleanup, I must have removed them by mistake...

        Would you mind to add a testEmptyNullComponents to TestCP? I'm ok with IllegalArgEx for trailing delimiters.

        Show
        Shai Erera added a comment - OK I agree: let's disallow empty string at indexing time. And null (for the vararg ctor). I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string? I think that both public ctors should throw IllegalArgException if they encounter empty or null strings. And you shouldn't worry about what's out there, because previously, CP converted all strings to a char[] and I'm sure all these cases were handled. And probably TestCategoryPath had tests that covered all that. But in the whole cleanup, I must have removed them by mistake... Would you mind to add a testEmptyNullComponents to TestCP? I'm ok with IllegalArgEx for trailing delimiters.
        Hide
        Michael McCandless added a comment -

        OK I agree: let's disallow empty string at indexing time.

        I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string?

        Not sure what (if anything?) to do about indices "out there" that already have empty string ... I'm not sure these ever causes problems except to PrintTaxonomyStats ... so I could just add some robustness to that one tool.

        However, I don't really like being "tolerant" to trailing delimiter, multiple delimiters in a row, etc. (like filesystems are): I would prefer that we are strict and accept only one form. That ambiguity can only cause problems/confusion.

        Show
        Michael McCandless added a comment - OK I agree: let's disallow empty string at indexing time. I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string? Not sure what (if anything?) to do about indices "out there" that already have empty string ... I'm not sure these ever causes problems except to PrintTaxonomyStats ... so I could just add some robustness to that one tool. However, I don't really like being "tolerant" to trailing delimiter, multiple delimiters in a row, etc. (like filesystems are): I would prefer that we are strict and accept only one form. That ambiguity can only cause problems/confusion.
        Hide
        Shai Erera added a comment -

        Good catch. But I'm not sure about the fix. I.e. new CP("test","") could also be handled to be equivalent to new CP("test").
        Also, it's not clear if new CP("test///foo",'/') should generate a CP that's similar to test/foo.
        But for sure, new CP("test/", '/') should be equivalent to new CP("test"), right?

        I think that for simplicity, we should eliminate all empty strings, from both ctors. So:

        • new CP("test") == new CP("test", '/') --> "test"
        • new CP("test", "") == new CP("test/", '/') --> "test"
        • new CP("test", "", "foo") == new CP("test//foo", '/') --> "test/foo"

        This is very similar to FS hierarchies I think. And it's easy to document and handle.

        Also, does the test really test the bug? I.e. the only fix is in the CP ctor which takes a String and delimiter, yet the test seems to be using the vararg String... ctor?

        Show
        Shai Erera added a comment - Good catch. But I'm not sure about the fix. I.e. new CP("test","") could also be handled to be equivalent to new CP("test"). Also, it's not clear if new CP("test///foo",'/') should generate a CP that's similar to test/foo. But for sure, new CP("test/", '/') should be equivalent to new CP("test"), right? I think that for simplicity, we should eliminate all empty strings, from both ctors. So: new CP("test") == new CP("test", '/') --> "test" new CP("test", "") == new CP("test/", '/') --> "test" new CP("test", "", "foo") == new CP("test//foo", '/') --> "test/foo" This is very similar to FS hierarchies I think. And it's easy to document and handle. Also, does the test really test the bug? I.e. the only fix is in the CP ctor which takes a String and delimiter, yet the test seems to be using the vararg String... ctor?
        Michael McCandless made changes -
        Attachment LUCENE-4724.patch [ 12566665 ]
        Hide
        Michael McCandless added a comment -

        New patch w/ fix.

        The problem was String.split: if you end with a delimiter, or have multiple delimiters in a row, then you lose the empty strings ...

        Show
        Michael McCandless added a comment - New patch w/ fix. The problem was String.split: if you end with a delimiter, or have multiple delimiters in a row, then you lose the empty strings ...
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-4724.patch [ 12566664 ]
        Hide
        Michael McCandless added a comment -

        Test case showing the issue ...

        Show
        Michael McCandless added a comment - Test case showing the issue ...
        Michael McCandless created issue -

          People

          • Assignee:
            Shai Erera
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development