Lucene - Core
  1. Lucene - Core
  2. LUCENE-4565

Simplify TaxoReader ParentArray/ChildrenArrays

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 5.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      TaxoReader exposes two structures which provide information about a categories parent/childs/siblings: ParentArray and ChildrenArrays. ChildrenArrays are derived (i.e. created) from ParentArray.

      I propose to consolidate all that into one API ParentInfo, or CategoryTreeInfo (a better name?) which will provide the same information, only from one object. So instead of making these calls:

      int[] parents = taxoReader.getParentArray();
      int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray();
      int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray();
      

      one would make these calls:

      int[] parents = taxoReader.getParentInfo().parents();
      int[] youngestChilds = taxoReader.getParentInfo().youngestChilds();
      int[] olderSiblings = taxoReader.getParentInfo().olderSiblings();
      

      Not a big change, just consolidate more code into one logical place. All of these arrays will continue to be lazily allocated.

        Activity

        Hide
        Shai Erera added a comment -

        I was thinking about a name for this ParentInfo (I don't like 'info' much), and here are a couple of suggestions: Family, TaxonomyTree, Ancestree and ParallelTaxonomyArray (thanks Robert for the analogy to ParallelPostingsArray !).

        So far I like ParallelTaxonomyArray, but if anyone has a better suggestion, please speak up !

        Show
        Shai Erera added a comment - I was thinking about a name for this ParentInfo (I don't like 'info' much), and here are a couple of suggestions: Family, TaxonomyTree, Ancestree and ParallelTaxonomyArray (thanks Robert for the analogy to ParallelPostingsArray !). So far I like ParallelTaxonomyArray, but if anyone has a better suggestion, please speak up !
        Hide
        Gilad Barkai added a comment -

        FamilyTree?
        Genealogy?
        How about Stemma ?

        Show
        Gilad Barkai added a comment - FamilyTree? Genealogy? How about Stemma ?
        Hide
        Shai Erera added a comment -

        Patch replaces ParentArray and ChildrenArrays by ParallelTaxonomyArrays. Most of the changes are due to this API change, but nothing conceptually changed. You still get 3 arrays for traversing the taxonomy tree, only now from one object.

        All tests pass.

        Gilad, I chose ParallelTaxoArrays because in the end this object returns you arrays and nothing else. The ParallelArrays name already exists in Lucene (ParallelPostingArrays), and so I've followed the same convention.

        If one day we'll change this object to provide service methods for getting parent, child and sibling, then a better name (like FamilyTree) might be more adequate.

        Show
        Shai Erera added a comment - Patch replaces ParentArray and ChildrenArrays by ParallelTaxonomyArrays. Most of the changes are due to this API change, but nothing conceptually changed. You still get 3 arrays for traversing the taxonomy tree, only now from one object. All tests pass. Gilad, I chose ParallelTaxoArrays because in the end this object returns you arrays and nothing else. The ParallelArrays name already exists in Lucene (ParallelPostingArrays), and so I've followed the same convention. If one day we'll change this object to provide service methods for getting parent , child and sibling , then a better name (like FamilyTree) might be more adequate.
        Hide
        Gilad Barkai added a comment -

        Patch's OK, some comments:

        • The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's API. But would rather have them everywhere.
        • In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. I got confused trying to find the reusability. Is there an internal java expert optimization for such cases?
        Show
        Gilad Barkai added a comment - Patch's OK, some comments: The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's API. But would rather have them everywhere. In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. I got confused trying to find the reusability. Is there an internal java expert optimization for such cases?
        Hide
        Shai Erera added a comment -

        The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's AP

        I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually.

        In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused.

        They are reused. Look at the calls below - according to the javadocs, passing the TermsEnum/DocsEnum to the respective methods reuses them.

        termsEnum = terms.iterator(termsEnum);
        and
        docs = termsEnum.docs(null, docs, 0 /* freqs not required */);
        
        Show
        Shai Erera added a comment - The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's AP I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually. In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. They are reused. Look at the calls below - according to the javadocs, passing the TermsEnum/DocsEnum to the respective methods reuses them. termsEnum = terms.iterator(termsEnum); and docs = termsEnum.docs( null , docs, 0 /* freqs not required */);
        Hide
        Gilad Barkai added a comment -

        I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually.

        .children() looks like it's getting all children, it does not. .siblings() looks like it returns all siblings, which it does not. I think a good JavaDoc is a blessing, but it's not a penalty in making the code document itself - which it did till now. I see no reason to change that.

        They are reused

        My bad! I need coffee...

        Show
        Gilad Barkai added a comment - I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually. .children() looks like it's getting all children, it does not. .siblings() looks like it returns all siblings, which it does not. I think a good JavaDoc is a blessing, but it's not a penalty in making the code document itself - which it did till now. I see no reason to change that. They are reused My bad! I need coffee...
        Hide
        Shai Erera added a comment -

        I don't think that youngestChild() is clearer than children() + the javadocs. Only contains more characters. Naming is hard, I don't think that the existing names are good. Maybe to those who developed the algorithm and know what it does. But I didn't understand the word 'youngest' until I saw an example on paper.

        Therefore whatever we call this method, I think that someone will need to read the javadocs (unless we call it childrenThatWereAddedLastToTheTaxonomy). I'll proceed with the short names. They're not wrong, they do return one child and one sibling (and one parent) of every category. Which child/sibling? Read the javadocs.

        Maybe tomorrow someone will come up w/ a better structure, where children[i] would actually mean oldestChild (and youngestSibling)? I don't know. So why commit to it in the API?

        Show
        Shai Erera added a comment - I don't think that youngestChild() is clearer than children() + the javadocs. Only contains more characters. Naming is hard, I don't think that the existing names are good. Maybe to those who developed the algorithm and know what it does. But I didn't understand the word 'youngest' until I saw an example on paper. Therefore whatever we call this method, I think that someone will need to read the javadocs (unless we call it childrenThatWereAddedLastToTheTaxonomy). I'll proceed with the short names. They're not wrong, they do return one child and one sibling (and one parent) of every category. Which child/sibling? Read the javadocs. Maybe tomorrow someone will come up w/ a better structure, where children [i] would actually mean oldestChild (and youngestSibling)? I don't know. So why commit to it in the API?
        Hide
        Shai Erera added a comment - - edited

        In fact, now that I think of it, one should not really care which child/sibling we return - that's implementation details. All that matters is how to traverse the taxonomy tree. You start from ROOT (ordinal=0), ask for children[0], and then depends if you want to do DFS/BFS you call siblings[children[0]] or children[children[0]] .. what does it matter if the child is the youngest/oldest?

        Show
        Shai Erera added a comment - - edited In fact, now that I think of it, one should not really care which child/sibling we return - that's implementation details. All that matters is how to traverse the taxonomy tree. You start from ROOT (ordinal=0), ask for children [0] , and then depends if you want to do DFS/BFS you call siblings[children [0] ] or children[children [0] ] .. what does it matter if the child is the youngest/oldest?
        Hide
        Shai Erera added a comment -

        I've decided to leave the API with the short names + javadocs. I thought about removing any mentions about youngest/oldest because that's implementation detail and might change one day. But I didn't.

        What should really matter to one is how to use the information in these 3 arrays to traverse the tree, so I added some explanation in the classes javadocs.

        Committed the changes to trunk and 4x.

        Show
        Shai Erera added a comment - I've decided to leave the API with the short names + javadocs. I thought about removing any mentions about youngest/oldest because that's implementation detail and might change one day. But I didn't. What should really matter to one is how to use the information in these 3 arrays to traverse the tree, so I added some explanation in the classes javadocs. Committed the changes to trunk and 4x.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1417893

        LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417893 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1417889

        LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417889 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays
        Hide
        Shai Erera added a comment -

        One data point about using a class with methods like getParent(), getChild() etc. I modified TestDirTaxoWriter.testConcurrency to check the parents too. I originally called TaxoReader.getParent() and ran with 100 iterations - took 160 seconds to complete all iters. I then modififed the code to get the parents[] array once, and then do parents[ord] instead, time dropped to 60 seconds.

        This is not a benchmark (TaxoReader.getParent() is not optimized to be inlined), but still a data point to refer to if we'll ever want to consider that.

        Show
        Shai Erera added a comment - One data point about using a class with methods like getParent(), getChild() etc. I modified TestDirTaxoWriter.testConcurrency to check the parents too. I originally called TaxoReader.getParent() and ran with 100 iterations - took 160 seconds to complete all iters. I then modififed the code to get the parents[] array once, and then do parents [ord] instead, time dropped to 60 seconds. This is not a benchmark (TaxoReader.getParent() is not optimized to be inlined), but still a data point to refer to if we'll ever want to consider that.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1417893

        LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417893 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development