Lucene - Core
  1. Lucene - Core
  2. LUCENE-4724

TaxonomyReader drops empty string component from CategoryPath

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I ran the new PrintTaxonomyStats on a Wikipedia facets index, and it hit an AIOOBE because there was a child of the /categories path that had only one component ... this was created because I had added new CategoryPath("categories", "") during indexing.

      I think TaxoReader should preserve and return that empty string from .getPath?

      1. LUCENE-4724.patch
        1 kB
        Michael McCandless
      2. LUCENE-4724.patch
        4 kB
        Michael McCandless
      3. LUCENE-4724.patch
        5 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        Test case showing the issue ...

        Show
        Michael McCandless added a comment - Test case showing the issue ...
        Hide
        Michael McCandless added a comment -

        New patch w/ fix.

        The problem was String.split: if you end with a delimiter, or have multiple delimiters in a row, then you lose the empty strings ...

        Show
        Michael McCandless added a comment - New patch w/ fix. The problem was String.split: if you end with a delimiter, or have multiple delimiters in a row, then you lose the empty strings ...
        Hide
        Shai Erera added a comment -

        Good catch. But I'm not sure about the fix. I.e. new CP("test","") could also be handled to be equivalent to new CP("test").
        Also, it's not clear if new CP("test///foo",'/') should generate a CP that's similar to test/foo.
        But for sure, new CP("test/", '/') should be equivalent to new CP("test"), right?

        I think that for simplicity, we should eliminate all empty strings, from both ctors. So:

        • new CP("test") == new CP("test", '/') --> "test"
        • new CP("test", "") == new CP("test/", '/') --> "test"
        • new CP("test", "", "foo") == new CP("test//foo", '/') --> "test/foo"

        This is very similar to FS hierarchies I think. And it's easy to document and handle.

        Also, does the test really test the bug? I.e. the only fix is in the CP ctor which takes a String and delimiter, yet the test seems to be using the vararg String... ctor?

        Show
        Shai Erera added a comment - Good catch. But I'm not sure about the fix. I.e. new CP("test","") could also be handled to be equivalent to new CP("test"). Also, it's not clear if new CP("test///foo",'/') should generate a CP that's similar to test/foo. But for sure, new CP("test/", '/') should be equivalent to new CP("test"), right? I think that for simplicity, we should eliminate all empty strings, from both ctors. So: new CP("test") == new CP("test", '/') --> "test" new CP("test", "") == new CP("test/", '/') --> "test" new CP("test", "", "foo") == new CP("test//foo", '/') --> "test/foo" This is very similar to FS hierarchies I think. And it's easy to document and handle. Also, does the test really test the bug? I.e. the only fix is in the CP ctor which takes a String and delimiter, yet the test seems to be using the vararg String... ctor?
        Hide
        Michael McCandless added a comment -

        OK I agree: let's disallow empty string at indexing time.

        I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string?

        Not sure what (if anything?) to do about indices "out there" that already have empty string ... I'm not sure these ever causes problems except to PrintTaxonomyStats ... so I could just add some robustness to that one tool.

        However, I don't really like being "tolerant" to trailing delimiter, multiple delimiters in a row, etc. (like filesystems are): I would prefer that we are strict and accept only one form. That ambiguity can only cause problems/confusion.

        Show
        Michael McCandless added a comment - OK I agree: let's disallow empty string at indexing time. I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string? Not sure what (if anything?) to do about indices "out there" that already have empty string ... I'm not sure these ever causes problems except to PrintTaxonomyStats ... so I could just add some robustness to that one tool. However, I don't really like being "tolerant" to trailing delimiter, multiple delimiters in a row, etc. (like filesystems are): I would prefer that we are strict and accept only one form. That ambiguity can only cause problems/confusion.
        Hide
        Shai Erera added a comment -

        OK I agree: let's disallow empty string at indexing time.

        And null (for the vararg ctor).

        I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string?

        I think that both public ctors should throw IllegalArgException if they encounter empty or null strings.

        And you shouldn't worry about what's out there, because previously, CP converted all strings to a char[] and I'm sure all these cases were handled. And probably TestCategoryPath had tests that covered all that. But in the whole cleanup, I must have removed them by mistake...

        Would you mind to add a testEmptyNullComponents to TestCP? I'm ok with IllegalArgEx for trailing delimiters.

        Show
        Shai Erera added a comment - OK I agree: let's disallow empty string at indexing time. And null (for the vararg ctor). I think this means the CP ctor that takes String... varargs should throw an exception if any component is the empty string? I think that both public ctors should throw IllegalArgException if they encounter empty or null strings. And you shouldn't worry about what's out there, because previously, CP converted all strings to a char[] and I'm sure all these cases were handled. And probably TestCategoryPath had tests that covered all that. But in the whole cleanup, I must have removed them by mistake... Would you mind to add a testEmptyNullComponents to TestCP? I'm ok with IllegalArgEx for trailing delimiters.
        Hide
        Shai Erera added a comment -

        Patch forbids empty or null components in CategoryPath ctors, plus matching test in TestCategoryPath. Mike, the test you added seems unneeded now, as one cannot create such a CP anymore.

        Note though that there was a test in TestCP which asserted an empty path component in the middle of a string. I think it's not an interesting use case, and still should not be allowed.

        I also had to modify another test which randomly generated CategoryPath strings (picking characters at random).

        Show
        Shai Erera added a comment - Patch forbids empty or null components in CategoryPath ctors, plus matching test in TestCategoryPath. Mike, the test you added seems unneeded now, as one cannot create such a CP anymore. Note though that there was a test in TestCP which asserted an empty path component in the middle of a string. I think it's not an interesting use case, and still should not be allowed. I also had to modify another test which randomly generated CategoryPath strings (picking characters at random).
        Hide
        Shai Erera added a comment -

        Also, note that the patch does not forbid passing strings like " " (i.e., not empty, but following trim() will be). I figured that this whole use case is not that common, and calling trim() is not that cheap, for every well-behaved app. Yet, I do want this to be a hard exception, not an assert.

        Show
        Shai Erera added a comment - Also, note that the patch does not forbid passing strings like " " (i.e., not empty, but following trim() will be). I figured that this whole use case is not that common, and calling trim() is not that cheap, for every well-behaved app. Yet, I do want this to be a hard exception, not an assert.
        Hide
        Michael McCandless added a comment -

        +1, thanks Shai!

        I think it's OK if we don't catch whitespace only components ...

        Show
        Michael McCandless added a comment - +1, thanks Shai! I think it's OK if we don't catch whitespace only components ...
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1439350

        LUCENE-4724: disallow empty or null strings as components

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1439350 LUCENE-4724 : disallow empty or null strings as components
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x.

        Show
        Shai Erera added a comment - Committed to trunk and 4x.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1439355

        LUCENE-4724: disallow empty or null strings as components

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1439355 LUCENE-4724 : disallow empty or null strings as components
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development