Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 4.1, 6.0
    • modules/facet
    • None
    • New, Patch Available

    Description

      TaxoReader exposes two structures which provide information about a categories parent/childs/siblings: ParentArray and ChildrenArrays. ChildrenArrays are derived (i.e. created) from ParentArray.

      I propose to consolidate all that into one API ParentInfo, or CategoryTreeInfo (a better name?) which will provide the same information, only from one object. So instead of making these calls:

      int[] parents = taxoReader.getParentArray();
      int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray();
      int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray();
      

      one would make these calls:

      int[] parents = taxoReader.getParentInfo().parents();
      int[] youngestChilds = taxoReader.getParentInfo().youngestChilds();
      int[] olderSiblings = taxoReader.getParentInfo().olderSiblings();
      

      Not a big change, just consolidate more code into one logical place. All of these arrays will continue to be lazily allocated.

      Attachments

        1. LUCENE-4565.patch
          49 kB
          Shai Erera

        Activity

          shaie Shai Erera added a comment -

          I was thinking about a name for this ParentInfo (I don't like 'info' much), and here are a couple of suggestions: Family, TaxonomyTree, Ancestree and ParallelTaxonomyArray (thanks Robert for the analogy to ParallelPostingsArray !).

          So far I like ParallelTaxonomyArray, but if anyone has a better suggestion, please speak up !

          shaie Shai Erera added a comment - I was thinking about a name for this ParentInfo (I don't like 'info' much), and here are a couple of suggestions: Family, TaxonomyTree, Ancestree and ParallelTaxonomyArray (thanks Robert for the analogy to ParallelPostingsArray !). So far I like ParallelTaxonomyArray, but if anyone has a better suggestion, please speak up !
          gilad Gilad Barkai added a comment -

          FamilyTree?
          Genealogy?
          How about Stemma ?

          gilad Gilad Barkai added a comment - FamilyTree? Genealogy? How about Stemma ?
          shaie Shai Erera added a comment -

          Patch replaces ParentArray and ChildrenArrays by ParallelTaxonomyArrays. Most of the changes are due to this API change, but nothing conceptually changed. You still get 3 arrays for traversing the taxonomy tree, only now from one object.

          All tests pass.

          Gilad, I chose ParallelTaxoArrays because in the end this object returns you arrays and nothing else. The ParallelArrays name already exists in Lucene (ParallelPostingArrays), and so I've followed the same convention.

          If one day we'll change this object to provide service methods for getting parent, child and sibling, then a better name (like FamilyTree) might be more adequate.

          shaie Shai Erera added a comment - Patch replaces ParentArray and ChildrenArrays by ParallelTaxonomyArrays. Most of the changes are due to this API change, but nothing conceptually changed. You still get 3 arrays for traversing the taxonomy tree, only now from one object. All tests pass. Gilad, I chose ParallelTaxoArrays because in the end this object returns you arrays and nothing else. The ParallelArrays name already exists in Lucene (ParallelPostingArrays), and so I've followed the same convention. If one day we'll change this object to provide service methods for getting parent , child and sibling , then a better name (like FamilyTree) might be more adequate.
          gilad Gilad Barkai added a comment -

          Patch's OK, some comments:

          • The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's API. But would rather have them everywhere.
          • In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. I got confused trying to find the reusability. Is there an internal java expert optimization for such cases?
          gilad Gilad Barkai added a comment - Patch's OK, some comments: The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's API. But would rather have them everywhere. In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. I got confused trying to find the reusability. Is there an internal java expert optimization for such cases?
          shaie Shai Erera added a comment -

          The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's AP

          I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually.

          In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused.

          They are reused. Look at the calls below - according to the javadocs, passing the TermsEnum/DocsEnum to the respective methods reuses them.

          termsEnum = terms.iterator(termsEnum);
          and
          docs = termsEnum.docs(null, docs, 0 /* freqs not required */);
          
          shaie Shai Erera added a comment - The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's AP I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually. In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. They are reused. Look at the calls below - according to the javadocs, passing the TermsEnum/DocsEnum to the respective methods reuses them. termsEnum = terms.iterator(termsEnum); and docs = termsEnum.docs( null , docs, 0 /* freqs not required */ );
          gilad Gilad Barkai added a comment -

          I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually.

          .children() looks like it's getting all children, it does not. .siblings() looks like it returns all siblings, which it does not. I think a good JavaDoc is a blessing, but it's not a penalty in making the code document itself - which it did till now. I see no reason to change that.

          They are reused

          My bad! I need coffee...

          gilad Gilad Barkai added a comment - I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually. .children() looks like it's getting all children, it does not. .siblings() looks like it returns all siblings, which it does not. I think a good JavaDoc is a blessing, but it's not a penalty in making the code document itself - which it did till now. I see no reason to change that. They are reused My bad! I need coffee...
          shaie Shai Erera added a comment -

          I don't think that youngestChild() is clearer than children() + the javadocs. Only contains more characters. Naming is hard, I don't think that the existing names are good. Maybe to those who developed the algorithm and know what it does. But I didn't understand the word 'youngest' until I saw an example on paper.

          Therefore whatever we call this method, I think that someone will need to read the javadocs (unless we call it childrenThatWereAddedLastToTheTaxonomy). I'll proceed with the short names. They're not wrong, they do return one child and one sibling (and one parent) of every category. Which child/sibling? Read the javadocs.

          Maybe tomorrow someone will come up w/ a better structure, where children[i] would actually mean oldestChild (and youngestSibling)? I don't know. So why commit to it in the API?

          shaie Shai Erera added a comment - I don't think that youngestChild() is clearer than children() + the javadocs. Only contains more characters. Naming is hard, I don't think that the existing names are good. Maybe to those who developed the algorithm and know what it does. But I didn't understand the word 'youngest' until I saw an example on paper. Therefore whatever we call this method, I think that someone will need to read the javadocs (unless we call it childrenThatWereAddedLastToTheTaxonomy). I'll proceed with the short names. They're not wrong, they do return one child and one sibling (and one parent) of every category. Which child/sibling? Read the javadocs. Maybe tomorrow someone will come up w/ a better structure, where children [i] would actually mean oldestChild (and youngestSibling)? I don't know. So why commit to it in the API?
          shaie Shai Erera added a comment - - edited

          In fact, now that I think of it, one should not really care which child/sibling we return - that's implementation details. All that matters is how to traverse the taxonomy tree. You start from ROOT (ordinal=0), ask for children[0], and then depends if you want to do DFS/BFS you call siblings[children[0]] or children[children[0]] .. what does it matter if the child is the youngest/oldest?

          shaie Shai Erera added a comment - - edited In fact, now that I think of it, one should not really care which child/sibling we return - that's implementation details. All that matters is how to traverse the taxonomy tree. You start from ROOT (ordinal=0), ask for children [0] , and then depends if you want to do DFS/BFS you call siblings[children [0] ] or children[children [0] ] .. what does it matter if the child is the youngest/oldest?
          shaie Shai Erera added a comment -

          I've decided to leave the API with the short names + javadocs. I thought about removing any mentions about youngest/oldest because that's implementation detail and might change one day. But I didn't.

          What should really matter to one is how to use the information in these 3 arrays to traverse the tree, so I added some explanation in the classes javadocs.

          Committed the changes to trunk and 4x.

          shaie Shai Erera added a comment - I've decided to leave the API with the short names + javadocs. I thought about removing any mentions about youngest/oldest because that's implementation detail and might change one day. But I didn't. What should really matter to one is how to use the information in these 3 arrays to traverse the tree, so I added some explanation in the classes javadocs. Committed the changes to trunk and 4x.
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1417893

          LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417893 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays
          commit-tag-bot Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1417889

          LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

          commit-tag-bot Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417889 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays
          shaie Shai Erera added a comment -

          One data point about using a class with methods like getParent(), getChild() etc. I modified TestDirTaxoWriter.testConcurrency to check the parents too. I originally called TaxoReader.getParent() and ran with 100 iterations - took 160 seconds to complete all iters. I then modififed the code to get the parents[] array once, and then do parents[ord] instead, time dropped to 60 seconds.

          This is not a benchmark (TaxoReader.getParent() is not optimized to be inlined), but still a data point to refer to if we'll ever want to consider that.

          shaie Shai Erera added a comment - One data point about using a class with methods like getParent(), getChild() etc. I modified TestDirTaxoWriter.testConcurrency to check the parents too. I originally called TaxoReader.getParent() and ran with 100 iterations - took 160 seconds to complete all iters. I then modififed the code to get the parents[] array once, and then do parents [ord] instead, time dropped to 60 seconds. This is not a benchmark (TaxoReader.getParent() is not optimized to be inlined), but still a data point to refer to if we'll ever want to consider that.
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1417893

          LUCENE-4565: Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays

          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1417893 LUCENE-4565 : Consolidate ParentArray and ChildrenArrays into ParallelTaxonomyArrays
          tomoko Tomoko Uchida added a comment -

          This issue was moved to GitHub issue: #5631.

          tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #5631 .

          People

            shaie Shai Erera
            shaie Shai Erera
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: