Lucene - Core
  1. Lucene - Core
  2. LUCENE-4703

Add basic tool to print some summary stats about your taxonomy index

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I built a Wikipedia index w/ 9 dimensions but I don't know how many ords each child contributes / how many immediate children under each dim / etc.

      1. LUCENE-4703.patch
        3 kB
        Michael McCandless
      2. LUCENE-4703.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, it seems to work ... it produces output like this:

        2508576 categories.
        Dimension 0: refCount; 495 immediate children; 496 categories
        Dimension 1: subSubSectionCount; 304 immediate children; 305 categories
        Dimension 2: subSectionCount; 277 immediate children; 278 categories
        Dimension 3: sectionCount; 282 immediate children; 283 categories
        Dimension 4: imageCount; 332 immediate children; 333 categories
        Dimension 5: categories; 2339724 immediate children; 2339725 categories
        Dimension 6: characterCount; 4 immediate children; 1086 categories
        Dimension 7: username; 162789 immediate children; 162790 categories
        Dimension 8: Date; 12 immediate children; 3279 categories
        
        Show
        Michael McCandless added a comment - Patch, it seems to work ... it produces output like this: 2508576 categories. Dimension 0: refCount; 495 immediate children; 496 categories Dimension 1: subSubSectionCount; 304 immediate children; 305 categories Dimension 2: subSectionCount; 277 immediate children; 278 categories Dimension 3: sectionCount; 282 immediate children; 283 categories Dimension 4: imageCount; 332 immediate children; 333 categories Dimension 5: categories; 2339724 immediate children; 2339725 categories Dimension 6: characterCount; 4 immediate children; 1086 categories Dimension 7: username; 162789 immediate children; 162790 categories Dimension 8: Date; 12 immediate children; 3279 categories
        Hide
        Shai Erera added a comment -

        Looks good!

        How about it printed info like this (I think that Dimension 0,1,2 is redundant):

        /refCount; 495 immediate children; 496 categories
        /subSubSectionCount; 304 immediate children; 305 categories
        

        Also, would it be interesting if it printed info recursively, e.g. you'd get a breakdown of characterCount too? Perhaps as a -recursive op?

        Show
        Shai Erera added a comment - Looks good! How about it printed info like this (I think that Dimension 0,1,2 is redundant): /refCount; 495 immediate children; 496 categories /subSubSectionCount; 304 immediate children; 305 categories Also, would it be interesting if it printed info recursively, e.g. you'd get a breakdown of characterCount too? Perhaps as a -recursive op?
        Hide
        Shai Erera added a comment -

        Hmm ... just a crazy idea ... what if TaxoReader had a getStats() method which returned you a tree-like structure? Something like:

        /** Returns a tree-like structure of the taxonomy with statistics such as 
         *  number of immediate children and number of categories overall.
         */
        public TaxonomyNode getStats() throws IOException;
        

        Then TaoxnomyNode denotes the root, and contains members like "label", "numCategories" and "children", which is a List<TaxonomyNode> ...

        Show
        Shai Erera added a comment - Hmm ... just a crazy idea ... what if TaxoReader had a getStats() method which returned you a tree-like structure? Something like: /** Returns a tree-like structure of the taxonomy with statistics such as * number of immediate children and number of categories overall. */ public TaxonomyNode getStats() throws IOException; Then TaoxnomyNode denotes the root, and contains members like "label", "numCategories" and "children", which is a List<TaxonomyNode> ...
        Hide
        Michael McCandless added a comment -

        How about it printed info like this (I think that Dimension 0,1,2 is redundant):

        Good, I'll fix!

        Also, would it be interesting if it printed info recursively, e.g. you'd get a breakdown of characterCount too? Perhaps as a -recursive op?

        This would be cool ... ok I added a -printTree option!

        Hmm ... just a crazy idea ... what if TaxoReader had a getStats() method which returned you a tree-like structure?

        That sounds cool ... but maybe wait and do this later? I think this tool is mostly for debugging/diagnostics/optimizing (eg trying to decide NO_PARENTS vs ALL_PARENTS).

        Show
        Michael McCandless added a comment - How about it printed info like this (I think that Dimension 0,1,2 is redundant): Good, I'll fix! Also, would it be interesting if it printed info recursively, e.g. you'd get a breakdown of characterCount too? Perhaps as a -recursive op? This would be cool ... ok I added a -printTree option! Hmm ... just a crazy idea ... what if TaxoReader had a getStats() method which returned you a tree-like structure? That sounds cool ... but maybe wait and do this later? I think this tool is mostly for debugging/diagnostics/optimizing (eg trying to decide NO_PARENTS vs ALL_PARENTS).
        Hide
        Michael McCandless added a comment -

        New patch w/ the changes.

        I also moved it to aol.facet.util, and renamed to PrintTaxonomyStats.

        I think the tool is ready ... but I need to make a test case ...

        Show
        Michael McCandless added a comment - New patch w/ the changes. I also moved it to aol.facet.util, and renamed to PrintTaxonomyStats. I think the tool is ready ... but I need to make a test case ...
        Hide
        Shai Erera added a comment -

        I think it's ready. If you had the TaxonomyNode API, writing the test would be easier . But I'm ok w/ deferring that.
        For the record, what I had in mind is a taxonomy-browse service, which lets you browse the taxonomy via this tree-like structure. But it can be done separately.

        Show
        Shai Erera added a comment - I think it's ready. If you had the TaxonomyNode API, writing the test would be easier . But I'm ok w/ deferring that. For the record, what I had in mind is a taxonomy-browse service, which lets you browse the taxonomy via this tree-like structure. But it can be done separately.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1436476

        LUCENE-4703: add simple tool to print summary stats of the facet taxonomy index

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1436476 LUCENE-4703 : add simple tool to print summary stats of the facet taxonomy index
        Hide
        Michael McCandless added a comment -

        For the record, what I had in mind is a taxonomy-browse service, which lets you browse the taxonomy via this tree-like structure. But it can be done separately.

        I think this would be useful ...

        Show
        Michael McCandless added a comment - For the record, what I had in mind is a taxonomy-browse service, which lets you browse the taxonomy via this tree-like structure. But it can be done separately. I think this would be useful ...
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1436477

        LUCENE-4703: add simple tool to print summary stats of the facet taxonomy index

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1436477 LUCENE-4703 : add simple tool to print summary stats of the facet taxonomy index
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development