Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).

      Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.

      We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.

      We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

      1. CorpusGenerator.java
        7 kB
        Toke Eskildsen
      2. LUCENE-3262.patch
        53 kB
        Doron Cohen
      3. LUCENE-3262.patch
        50 kB
        Doron Cohen
      4. LUCENE-3262.patch
        46 kB
        Doron Cohen
      5. TestPerformanceHack.java
        14 kB
        Toke Eskildsen

        Activity

        Hide
        Toke Eskildsen added a comment -

        I've attached a second shot at faceting performance testing. It separates the taxonomy generation into a CorpusGenerator (maybe similar to the RandomTaxonomyWriter that Robert calls for in LUCENE-3264?).

        Proper setup of faceting tweaks for the new faceting module is not done at all and not something I find myself qualified for.

        Show
        Toke Eskildsen added a comment - I've attached a second shot at faceting performance testing. It separates the taxonomy generation into a CorpusGenerator (maybe similar to the RandomTaxonomyWriter that Robert calls for in LUCENE-3264 ?). Proper setup of faceting tweaks for the new faceting module is not done at all and not something I find myself qualified for.
        Hide
        Doron Cohen added a comment -

        I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646.
        Currently the direction is:

        • Add to PerfRunData:
          • Taxonomy Directory
          • Taxonomy Writer
          • Taxonomy Reader
        • Add tasks for manipulating facets and taxonomies:
          • create/open/commit/close Taxonomy Index
          • open/close Taxonomy Reader
          • AddDocWith facets
        • FacetDocMaker will also build the categories into the document
        • FacetSource will bring back categories to be added to current doc
        • ReadTask will be extended to also support faceted search.
          This is different from the Solr benchmark approach, where a SolrSearchTask is not extending ReadTask but rather extending PerfTask.
          Not sure yet if this is the way to go - still work to be done here.

        Should have a start patch in a day or two.

        Show
        Doron Cohen added a comment - I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646 . Currently the direction is: Add to PerfRunData: Taxonomy Directory Taxonomy Writer Taxonomy Reader Add tasks for manipulating facets and taxonomies: create/open/commit/close Taxonomy Index open/close Taxonomy Reader AddDocWith facets FacetDocMaker will also build the categories into the document FacetSource will bring back categories to be added to current doc ReadTask will be extended to also support faceted search. This is different from the Solr benchmark approach, where a SolrSearchTask is not extending ReadTask but rather extending PerfTask. Not sure yet if this is the way to go - still work to be done here. Should have a start patch in a day or two.
        Hide
        Doron Cohen added a comment -

        Patch (3x) with working facets indexing benchmark.
        It follows the outline above, except that:

        • there is no FacetDocMaker - only FacetSource
        • there is no AddDocWithFacet - instead, AddDoc takes an additional config param: with.facet

        'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets.

        Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue.

        Show
        Doron Cohen added a comment - Patch (3x) with working facets indexing benchmark. It follows the outline above, except that: there is no FacetDocMaker - only FacetSource there is no AddDocWithFacet - instead, AddDoc takes an additional config param: with.facet 'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets. Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue.
        Hide
        Shai Erera added a comment -

        Patch looks good ! I have a couple of initial comments:

        • facets.alg: as I often find these .alg files as examples, I think it would be good if it declares facet.source (to random) explicitly.
        • OpenTaxonomyReaderTask: I see that since PerfRunData incRef() the incoming taxonomy, you decRef(). I also see that setIndexReader behaves the same way. But I find it confusing. Personally, since this is not an application, I don't think we should 'hold a reference to IR/LTR just in case the one who set it closes it'. But if we do that, can we at least document on setIR/LTR that this is the case? I can certainly see myself opening IR/LTR, setting on PerfRunData without decRef()/close(). It would not occur to me that I should ...
        • The abstraction of ItemSource is nice. But it's jdocs still contain content.source.. Since we're not committed to backwards compatibility in benchmark, and in the interest of clarity, perhaps we should rename them to item.source.?
        • ItemSource.resetInputs has a @SuppressWarnings("unused") – is it a leftover from when it was private?
        • In PerfRunData ctor you do a Class.forName using the String name of RandomFacetSource. Why not use RandomFacetSource.class.getName()?

        Looks very good. Now with FacetSource we can generate facets per the case we want to test (dense hierarchies, Zipf'ian ...)

        Show
        Shai Erera added a comment - Patch looks good ! I have a couple of initial comments: facets.alg: as I often find these .alg files as examples, I think it would be good if it declares facet.source (to random) explicitly. OpenTaxonomyReaderTask: I see that since PerfRunData incRef() the incoming taxonomy, you decRef(). I also see that setIndexReader behaves the same way. But I find it confusing. Personally, since this is not an application, I don't think we should 'hold a reference to IR/LTR just in case the one who set it closes it'. But if we do that, can we at least document on setIR/LTR that this is the case? I can certainly see myself opening IR/LTR, setting on PerfRunData without decRef()/close(). It would not occur to me that I should ... The abstraction of ItemSource is nice. But it's jdocs still contain content.source. . Since we're not committed to backwards compatibility in benchmark, and in the interest of clarity, perhaps we should rename them to item.source. ? ItemSource.resetInputs has a @SuppressWarnings("unused") – is it a leftover from when it was private? In PerfRunData ctor you do a Class.forName using the String name of RandomFacetSource. Why not use RandomFacetSource.class.getName()? Looks very good. Now with FacetSource we can generate facets per the case we want to test (dense hierarchies, Zipf'ian ...)
        Hide
        Doron Cohen added a comment -

        Thanks for reviewing Shai!

        • facets.alg: I agree, and added that specific facet source setting. With also modified the algorithm to have several rounds, some with facets, some without, so the effect of adding facets on performance is measured.
        • PFD.readers.incRef() - once the task (e.g. OpenReaderTask) opened the reader that task is gone and the reader really is maintained by the PFD, therefore IMO it is important that the PFD maintains its ref count as it does today. I agree that more documentation for this can help and added something in this spirit to both setReader()'s:
          /**
           * Set the taxonomy reader. Takes ownership of that taxonomy reader, that is,
           * internally performs taxoReader.incRef(). 
           * @param indexReader The indexReader to set.
           */
          public synchronized void setTaxonomyReader(TaxonomyReader taxoReader) throws IOException {
        
          }
        
          /**
           * Set the index reader. Takes ownership of that index reader, that is,
           * internally performs indexReader.incRef(). 
           * @param indexReader The indexReader to set.
           */
          public synchronized void setIndexReader(IndexReader indexReader) throws IOException {
          }
        
        • ItemSource and config props "source.xyz" - I am afraid there might be algorithms out there that will silently stop working, as the props names are not safe types or something like that. So I would rather to not alter these property names. Instead, following your comment, I renamed ItemSource to ContentItemsSource which is still extended by both ContentSource and FacetSource. So it is not that bad now that the property names start with "content." (these properties apply to all content-item-sources - that is both doc-content and facets).
        • ItemSource.resetInputs @SuppressWarnings("unused") - this is for the IOException which is not really thrown by that code (though it might be thrown by subclasses). Perhaps just a misconfiguration of my IDE? (I assume you have that warning disabled or stg?)
        • PerfRunData Class.forName on String rather than class.getName() - this is in fact the code pattern for all reflection actions in PerfRunData, so I just repeated for the new code whatever was there for the existing code. However, in fact, I thought about the same thing when adding this code. I think the difference is that if config contains a specification of another class, the default class would not even be loaded when using strings, but it will be loaded (though not used) when using Class.getName(). It is not a big performance issue, but it does seem "cleaner" to not load classes that are not going to be used. In any case, if we would like to change that, should revisit all the .forName(String) of default impls in the benchmark (and I think there are a few) and it seems to not belong to this issue.

        Thanks for the review, working on a new patch - there are several copy/paste errors in the code where a CloseTaxonomyReader by mistake sets the PFD IndexReader to null...

        Show
        Doron Cohen added a comment - Thanks for reviewing Shai! facets.alg: I agree, and added that specific facet source setting. With also modified the algorithm to have several rounds, some with facets, some without, so the effect of adding facets on performance is measured. PFD.readers.incRef() - once the task (e.g. OpenReaderTask) opened the reader that task is gone and the reader really is maintained by the PFD, therefore IMO it is important that the PFD maintains its ref count as it does today. I agree that more documentation for this can help and added something in this spirit to both setReader()'s: /** * Set the taxonomy reader. Takes ownership of that taxonomy reader, that is, * internally performs taxoReader.incRef(). * @param indexReader The indexReader to set. */ public synchronized void setTaxonomyReader(TaxonomyReader taxoReader) throws IOException { } /** * Set the index reader. Takes ownership of that index reader, that is, * internally performs indexReader.incRef(). * @param indexReader The indexReader to set. */ public synchronized void setIndexReader(IndexReader indexReader) throws IOException { } ItemSource and config props "source.xyz" - I am afraid there might be algorithms out there that will silently stop working, as the props names are not safe types or something like that. So I would rather to not alter these property names. Instead, following your comment, I renamed ItemSource to ContentItemsSource which is still extended by both ContentSource and FacetSource. So it is not that bad now that the property names start with "content." (these properties apply to all content-item-sources - that is both doc-content and facets). ItemSource.resetInputs @SuppressWarnings("unused") - this is for the IOException which is not really thrown by that code (though it might be thrown by subclasses). Perhaps just a misconfiguration of my IDE? (I assume you have that warning disabled or stg?) PerfRunData Class.forName on String rather than class.getName() - this is in fact the code pattern for all reflection actions in PerfRunData, so I just repeated for the new code whatever was there for the existing code. However, in fact, I thought about the same thing when adding this code. I think the difference is that if config contains a specification of another class, the default class would not even be loaded when using strings, but it will be loaded (though not used) when using Class.getName(). It is not a big performance issue, but it does seem "cleaner" to not load classes that are not going to be used. In any case, if we would like to change that, should revisit all the .forName(String) of default impls in the benchmark (and I think there are a few) and it seems to not belong to this issue. Thanks for the review, working on a new patch - there are several copy/paste errors in the code where a CloseTaxonomyReader by mistake sets the PFD IndexReader to null...
        Hide
        Shai Erera added a comment -

        ItemSource.resetInputs

        I don't have that warning turned on in Eclipse. I disabled it for exactly this reason .

        ItemSource rename

        The new name is ok, and the properties better fit it. BTW, if you wanted to have the .algs out there to not silently fail, you could add some code to setConfig that checks for these outdated properties, and throw a proper exception. But I'm ok with the solution you chose.

        PFD.readers.incRef()

        The javadocs are good. I'd also add "<b>NOTE:</b> if you no longer need that IndexReader/TaxoReader, you should decRef()/close() after calling this method". Otherwise, the IR/TR will just stay open ...

        Show
        Shai Erera added a comment - ItemSource.resetInputs I don't have that warning turned on in Eclipse. I disabled it for exactly this reason . ItemSource rename The new name is ok, and the properties better fit it. BTW, if you wanted to have the .algs out there to not silently fail, you could add some code to setConfig that checks for these outdated properties, and throw a proper exception. But I'm ok with the solution you chose. PFD.readers.incRef() The javadocs are good. I'd also add "<b>NOTE:</b> if you no longer need that IndexReader/TaxoReader, you should decRef()/close() after calling this method". Otherwise, the IR/TR will just stay open ...
        Hide
        Doron Cohen added a comment -

        Updated patch with a test, more javadocs, and a comment as Shai suggested.

        I think this is ready to commit.

        More tests are needed, and also Search with facets is missing, but that can go in a separate issue.

        Show
        Doron Cohen added a comment - Updated patch with a test, more javadocs, and a comment as Shai suggested. I think this is ready to commit. More tests are needed, and also Search with facets is missing, but that can go in a separate issue.
        Hide
        Shai Erera added a comment -

        I think this is ready to commit.

        +1. Perhaps just add a CHANGES entry?

        but that can go in a separate issue

        I think it's better if we resolve it in that issue, and maybe rename the issue to "Facet benchmarking framework". You can still commit the current progress because it is 'whole' - covering the indexing side. I've worked on issues before that had several commits, so this will not be the first one.

        We should also run some benchmark tests, describing clearly the data sets, but this can be done under a separate issue.

        Show
        Shai Erera added a comment - I think this is ready to commit. +1. Perhaps just add a CHANGES entry? but that can go in a separate issue I think it's better if we resolve it in that issue, and maybe rename the issue to "Facet benchmarking framework". You can still commit the current progress because it is 'whole' - covering the indexing side. I've worked on issues before that had several commits, so this will not be the first one. We should also run some benchmark tests, describing clearly the data sets, but this can be done under a separate issue.
        Hide
        Doron Cohen added a comment -

        changes entry

        Right, I always forget to include it in the patch, and add it only afterwords, should change that...

        Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it.

        Show
        Doron Cohen added a comment - changes entry Right, I always forget to include it in the patch, and add it only afterwords, should change that... Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it.
        Hide
        Doron Cohen added a comment -

        Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true.

        Show
        Doron Cohen added a comment - Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true.
        Hide
        Shai Erera added a comment -

        Also that with.facets property is useful for comparisons

        What do you mean? Someone can use AddFacetedDocTask w/ and w/o facets? What for? (sorry, but you didn't post a patch, so I cannot see what this is about)

        Show
        Shai Erera added a comment - Also that with.facets property is useful for comparisons What do you mean? Someone can use AddFacetedDocTask w/ and w/o facets? What for? (sorry, but you didn't post a patch, so I cannot see what this is about)
        Hide
        Doron Cohen added a comment -

        Someone can use AddFacetedDocTask w/ and w/o facets? What for?

        It is useful for specifying the property like this:

        with.facets=facets:true:false
        ...
        { "MAddDocs" AddFacetedDoc > : 400
        

        and then getting in the report something like this:

        ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42)
        Operation    round facets   runCnt   recsPerRun        rec/s  elapsedSec
        MAddDocs_400     0   true        1          400       246.61        1.62
        MAddDocs_400 -   1  false -  -   1 -  -  -  400 -   1,801.80 -  -   0.22
        MAddDocs_400     2   true        1          400       412.80        0.97
        MAddDocs_400 -   3  false -  -   1 -  -  -  400 -   2,139.04 -  -   0.19
        
        Show
        Doron Cohen added a comment - Someone can use AddFacetedDocTask w/ and w/o facets? What for? It is useful for specifying the property like this: with.facets=facets: true : false ... { "MAddDocs" AddFacetedDoc > : 400 and then getting in the report something like this: ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42) Operation round facets runCnt recsPerRun rec/s elapsedSec MAddDocs_400 0 true 1 400 246.61 1.62 MAddDocs_400 - 1 false - - 1 - - - 400 - 1,801.80 - - 0.22 MAddDocs_400 2 true 1 400 412.80 0.97 MAddDocs_400 - 3 false - - 1 - - - 400 - 2,139.04 - - 0.19
        Hide
        Doron Cohen added a comment -

        Updated patch according to Shai's comments and with AddFacetedDoc task.

        Show
        Doron Cohen added a comment - Updated patch according to Shai's comments and with AddFacetedDoc task.
        Hide
        Shai Erera added a comment -

        Ahh, forgot about iterations. It does indeed look useful then.

        Perhaps mention facet.source in AddFacetedDocTask?

        I'm +1 for committing the current progress, but keep this issue open for the search side (to complete the framework).

        Show
        Shai Erera added a comment - Ahh, forgot about iterations. It does indeed look useful then. Perhaps mention facet.source in AddFacetedDocTask? I'm +1 for committing the current progress, but keep this issue open for the search side (to complete the framework).
        Hide
        Gilad Barkai added a comment -

        Doron, great patch!

        I ran it and was somewhat surprised at the large overhead of the facet indexing. Digging deeper, I found the number of random facets to be 1-120 per document, with depth of 1-8. I believe those are overkill requirements. I reduced those to 1-20 per document with depth of 1-3 and got results I could live with.
        Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers. What do you think?

        Also, I changed the alg to consume the entire content source.

        I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth. Depth seems more appropriate.

        Other than that - I'm thrilled to have a working benchmark with facets - thanks!

        Show
        Gilad Barkai added a comment - Doron, great patch! I ran it and was somewhat surprised at the large overhead of the facet indexing. Digging deeper, I found the number of random facets to be 1-120 per document, with depth of 1-8. I believe those are overkill requirements. I reduced those to 1- 20 per document with depth of 1- 3 and got results I could live with. Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers. What do you think? Also, I changed the alg to consume the entire content source. I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet. depth and maxFacetDepth. Depth seems more appropriate. Other than that - I'm thrilled to have a working benchmark with facets - thanks!
        Hide
        Shai Erera added a comment -

        Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers

        I think we should maybe have a Wiki page or something with several .alg files that test different scenarios. I don't think that 1-120 is an example we shouldn't test. Rather, we should describe, either in a Wiki or a JIRA issue, the performance results for each scenario. And if the results are suspicious for a particular scenario, dig deeper and understand why.

        So given that you know the numbers from above were run with that many facets per document, do the numbers make sense? Or you still think they're high?

        I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth

        +1.

        Show
        Shai Erera added a comment - Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers I think we should maybe have a Wiki page or something with several .alg files that test different scenarios. I don't think that 1-120 is an example we shouldn't test. Rather, we should describe, either in a Wiki or a JIRA issue, the performance results for each scenario. And if the results are suspicious for a particular scenario, dig deeper and understand why. So given that you know the numbers from above were run with that many facets per document, do the numbers make sense? Or you still think they're high? I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth +1.
        Hide
        Doron Cohen added a comment -

        I reduced those to 1-20 per document with depth of 1-3 and got results I could live with.

        I agree, tried this too now and the comparison is more reasonable.
        Perhaps what are reasonable numbers (for #facets/doc and their depth) is debatable, but I agree that 200 facets per document is too many.

        Changing the defaults to 20/3 and preparing to commit.

        Show
        Doron Cohen added a comment - I reduced those to 1-20 per document with depth of 1-3 and got results I could live with. I agree, tried this too now and the comparison is more reasonable. Perhaps what are reasonable numbers (for #facets/doc and their depth) is debatable, but I agree that 200 facets per document is too many. Changing the defaults to 20/3 and preparing to commit.
        Hide
        Doron Cohen added a comment -

        Committed to 3x in r1180637, thanks Gilad!
        Now porting to trunk, it is more involved than anticipated, because of contrib/modules differences.
        Managed to make the tests pass, and the benchmark alg of choice to run.
        However I noticed that in 3x that alg - when indexing reuters - added the entire collection, that is 21578 docs, while in trunk it only added about 400 docs. Might be something in my set-up, digging...

        Show
        Doron Cohen added a comment - Committed to 3x in r1180637, thanks Gilad! Now porting to trunk, it is more involved than anticipated, because of contrib/modules differences. Managed to make the tests pass, and the benchmark alg of choice to run. However I noticed that in 3x that alg - when indexing reuters - added the entire collection, that is 21578 docs, while in trunk it only added about 400 docs. Might be something in my set-up, digging...
        Hide
        Doron Cohen added a comment -

        After manually removing benchmark/

        {work,temp}

        reuters collection was correctly extracted and in trunk the alg runs same as in 3x.
        Committed to trunk: r1180674.
        Keeping issue open to handle faceted search benchmarking.

        Show
        Doron Cohen added a comment - After manually removing benchmark/ {work,temp} reuters collection was correctly extracted and in trunk the alg runs same as in 3x. Committed to trunk: r1180674. Keeping issue open to handle faceted search benchmarking.

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development