Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      CategoryPath is supposed to be a simple object which holds a category path's components, and offers some utility methods that can be used during indexing and search.

      Currently, it exposes lots of methods which aren't used, unless by tests - I want to get rid of them. Also, the internal implementation manages 3 char[] for holding the path components, while I think it would have been simpler if it maintained a String[]. I'd like to explore that option too (the input is anyway String, so why copy char[]?).

      Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most of the mutable methods. The ones that remain will probably go away when I move from char[] to String[]. Immuntability is important because in various places in the code we convert a CategoryPath back and forth to String, with TODOs to stop doing that if CP was immutable.

      Will attach a patch that covers the first step - get rid of unneeded methods and beginning to make it immutable.

      Perhaps this can be done in multiple commits?

      1. LUCENE-4659.patch
        153 kB
        Shai Erera
      2. LUCENE-4659.patch
        153 kB
        Shai Erera
      3. LUCENE-4659.patch
        52 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        Initial cleanup (already cuts down CategoryPath by ~200 lines). More to follow.

        Show
        Shai Erera added a comment - Initial cleanup (already cuts down CategoryPath by ~200 lines). More to follow.
        Hide
        Michael McCandless added a comment -

        +1

        I agree CP should be immutable.

        Show
        Michael McCandless added a comment - +1 I agree CP should be immutable.
        Hide
        Gilad Barkai added a comment -

        I think I wished for CP to become immutable as one of my birthday wishes (while taking out the candles).. these do come true!
        That's a very promising start, way to go!

        CP's internals of char[] was for performance and reusability, I thought it was used internally in the CL2O cache - but now I see that it does not? hmm..
        Other methods which could be killed, and are in the way of immutability: .setFromSerializable() and .add() which are only used in tests.

        Than, the only method which breaks immutability is .trim(), and it could be managed with a C'tor.

        Show
        Gilad Barkai added a comment - I think I wished for CP to become immutable as one of my birthday wishes (while taking out the candles).. these do come true! That's a very promising start, way to go! CP's internals of char[] was for performance and reusability, I thought it was used internally in the CL2O cache - but now I see that it does not? hmm.. Other methods which could be killed, and are in the way of immutability: .setFromSerializable() and .add() which are only used in tests. Than, the only method which breaks immutability is .trim() , and it could be managed with a C'tor.
        Hide
        Shai Erera added a comment -

        I've actually made more progress, and now CP is completely immutable and manages inside a String[] with start/end offsets. I think that I just need an end offset, but nm that for now.

        What fails for me is CompactLabelToOrdinalTest, I'm not sure why. From the debugging sessions I've done so far, it may be related to hash computation, even though I reviewed the changes and I think that I was consistent. Anyway, want to debug it more before I'll post another patch.

        I've kept trim() because it's used in DrillDownStream, but it returns a new CategoryPath. And I've kept add() because it is convenient for tests, but it also returns a new CategoryPath.

        Hopefully my next update will include the fix for this test. All other tests pass.

        Show
        Shai Erera added a comment - I've actually made more progress, and now CP is completely immutable and manages inside a String[] with start/end offsets. I think that I just need an end offset, but nm that for now. What fails for me is CompactLabelToOrdinalTest, I'm not sure why. From the debugging sessions I've done so far, it may be related to hash computation, even though I reviewed the changes and I think that I was consistent. Anyway, want to debug it more before I'll post another patch. I've kept trim() because it's used in DrillDownStream, but it returns a new CategoryPath. And I've kept add() because it is convenient for tests, but it also returns a new CategoryPath. Hopefully my next update will include the fix for this test. All other tests pass.
        Hide
        Shai Erera added a comment -

        More cleanup. Here's a summary of all the changes:

        • CategoryPath is now immutable. As such it does not implement Cloneable and clone() is removed.
        • CategoryPath.add is removed - used only by tests, and they were converted to use alternative methods, e.g. subpath().
        • CategoryPath.subpath returns a sub-path of the CategoryPath (used to traverse parents)
        • CategoryPath exposes two public final 'components' and 'length' members. This removed the length() and getComponent() methods (getComponent was only used in test). I added jdocs to these two members.
        • CategoryPath no longer implements Serializable.
        • CategoryPath.*serialize* methods moved to CategoryPathUtils under the cl2o package, as it's the only one that uses them.
        • All 'prefixLen' variants were removed. Instead one should call cp.subpath(prefix).
          • This allowed removing all prefixLen variants from TaxoWriterCache, and nuke 3 duplicate methods in DirTaxoWriter.
        • Handled a TODO in DirTaxoReader - since now CategoryPath is immutable, it can hold caches to-from CategoryPath instead of String.

        CategoryPath is now 215 lines, instead of ~1100 !

        Show
        Shai Erera added a comment - More cleanup. Here's a summary of all the changes: CategoryPath is now immutable. As such it does not implement Cloneable and clone() is removed. CategoryPath.add is removed - used only by tests, and they were converted to use alternative methods, e.g. subpath(). CategoryPath.subpath returns a sub-path of the CategoryPath (used to traverse parents) CategoryPath exposes two public final 'components' and 'length' members. This removed the length() and getComponent() methods (getComponent was only used in test). I added jdocs to these two members. CategoryPath no longer implements Serializable. CategoryPath.*serialize* methods moved to CategoryPathUtils under the cl2o package, as it's the only one that uses them. All 'prefixLen' variants were removed. Instead one should call cp.subpath(prefix). This allowed removing all prefixLen variants from TaxoWriterCache, and nuke 3 duplicate methods in DirTaxoWriter. Handled a TODO in DirTaxoReader - since now CategoryPath is immutable, it can hold caches to-from CategoryPath instead of String. CategoryPath is now 215 lines, instead of ~1100 !
        Hide
        Gilad Barkai added a comment -

        Patch looks really good.

        I'm not concerned about the new objects for subPath. Actually, since the HashMap s are now against CategoryPath and it's not constantly being translated to/from String I'm looking forward a better performance than before.

        Nice job, and a nasty one that must have been..
        Chapeau à lui!

        A few (minor) comments:

        • copyFullPath() - numCharsCopied is redundant? The return value could have been (idx + component[upto].length() - start)
        • equals() line 143 - perhaps use only one index rather than both index j and i ? Also, CPs are more likely to be different at the end, than in the start (e.g further away from the root than the dimension) - perhaps iterate in reverse (up the tree)?
        Show
        Gilad Barkai added a comment - Patch looks really good. I'm not concerned about the new objects for subPath. Actually, since the HashMap s are now against CategoryPath and it's not constantly being translated to/from String I'm looking forward a better performance than before. Nice job, and a nasty one that must have been.. Chapeau à lui! A few (minor) comments: copyFullPath() - numCharsCopied is redundant? The return value could have been (idx + component [upto] .length() - start) equals() line 143 - perhaps use only one index rather than both index j and i ? Also, CPs are more likely to be different at the end, than in the start (e.g further away from the root than the dimension) - perhaps iterate in reverse (up the tree)?
        Hide
        Shai Erera added a comment -

        copyFullPath() - numCharsCopied is redundant?

        You're right. Fixed

        perhaps use only one index rather than both index j and i?

        Yup. It was a leftover from the previous patch, when I had start offset in CP.

        Also, CPs are more likely to be different at the end, than in the start (e.g further away from the root than the dimension) - perhaps iterate in reverse (up the tree)

        Done.

        Show
        Shai Erera added a comment - copyFullPath() - numCharsCopied is redundant? You're right. Fixed perhaps use only one index rather than both index j and i? Yup. It was a leftover from the previous patch, when I had start offset in CP. Also, CPs are more likely to be different at the end, than in the start (e.g further away from the root than the dimension) - perhaps iterate in reverse (up the tree) Done.
        Hide
        Gilad Barkai added a comment -

        Magnificent, +1 for commit.

        Show
        Gilad Barkai added a comment - Magnificent, +1 for commit.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1429570

        LUCENE-4659: Cleanup CategoryPath

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1429570 LUCENE-4659 : Cleanup CategoryPath
        Hide
        Shai Erera added a comment -

        Clarified more CatPath.components jdocs. Committed to trunk and 4x.

        Show
        Shai Erera added a comment - Clarified more CatPath.components jdocs. Committed to trunk and 4x.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1429573

        LUCENE-4659: Cleanup CategoryPath

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1429573 LUCENE-4659 : Cleanup CategoryPath

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development