Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Faceting is a hugely important feature, available in Solr today but
      not [easily] usable by Lucene-only apps.

      We should fix this, by creating a shared faceting module.

      Ideally, we factor out Solr's faceting impl, and maybe poach/merge
      from other impls (eg Bobo browse).

      Hoss describes some important challenges we'll face in doing this
      (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:

      To look at "faceting" as a concrete example, there are big the reasons 
      faceting works so well in Solr: Solr has total control over the 
      index, knows exactly when the index has changed to rebuild caches, has a 
      strict schema so it can make sense of field types and 
      pick faceting algos accordingly, has multi-phase distributed search 
      approach to get exact counts efficiently across multiple shards, etc...
      (and there are still a lot of additional enhancements and improvements 
      that can be made to take even more advantage of knowledge solr has because 
      it "owns" the index that we no one has had time to tackle)
      

      This is a great list of the things we face in refactoring. It's also
      important because, if Solr needed to be so deeply intertwined with
      caching, schema, etc., other apps that want to facet will have the
      same "needs" and so we really have to address them in creating the
      shared module.

      I think we should get a basic faceting module started, but should not
      cut Solr over at first. We should iterate on the module, fold in
      improvements, etc., and then, once we can fully verify that cutting
      over doesn't hurt Solr (ie lose functionality or performance) we can
      later cutover.

      1. LUCENE-3079.patch
        53 kB
        Robert Muir
      2. LUCENE-3079_4x.patch
        52 kB
        Robert Muir
      3. LUCENE-3079_4x_broken.patch
        50 kB
        Robert Muir
      4. facet-userguide.pdf
        283 kB
        Shai Erera
      5. TestPerformanceHack.java
        12 kB
        Toke Eskildsen
      6. LUCENE-3079-dev-tools.patch
        4 kB
        Shai Erera
      7. LUCENE-3079.patch
        1.49 MB
        Robert Muir
      8. LUCENE-3079.patch
        1.55 MB
        Shai Erera
      9. LUCENE-3079.patch
        42 kB
        Stefan Trcek

        Issue Links

          Activity

          Hide
          Stefan Trcek added a comment -

          This patch was generated by git and tested to apply with
          patch -p0 -i LUCENE-3079.patch --dry-run
          Be patient if anything went wrong.

          Review starting points may be

          • FacetSearcherTest.testSimpleFacetWithIndexSearcher() or
          • FacetSearcher.facetCollectSearch()

          Functions.java may be dismissed in favor of Guava.
          If you are willing to keep it I'll strip it down to the required parts.

          ----------------------------------------------------------------------

          The implementation relies on field cache only, no index scheme, no
          cached filters etc. It supports

          • single valued facets (Facet.java)
          • multi valued facets (Facet.MultiValued.java)
          • facet filters (see FacetSearcher.java)
          • evaluation of facet values that would dismiss due to other facet
            filters (Yonik says Solr calls this "multi-select faceting").
            (realized by FacetSearcher.fillFacetsForGuiMode())

          Let me explain the last point: For the user a facet query
            (color==green) AND (shape==circle OR shape==square)
          may look like

          Facet color
          [ ] (3) red
          [x] (5) green
          [ ] (7) blue

          Facet shape
          [x] (9) circle
          [ ] (4) line
          [x] (2) square

          The red/blue/line facet values will display even though the
          corresponding documents are not in the result set. Also there is
          support for filtered facet values with zero results, so users
          understand why they do not get results.

          Show
          Stefan Trcek added a comment - This patch was generated by git and tested to apply with patch -p0 -i LUCENE-3079 .patch --dry-run Be patient if anything went wrong. Review starting points may be FacetSearcherTest.testSimpleFacetWithIndexSearcher() or FacetSearcher.facetCollectSearch() Functions.java may be dismissed in favor of Guava. If you are willing to keep it I'll strip it down to the required parts. ---------------------------------------------------------------------- The implementation relies on field cache only, no index scheme, no cached filters etc. It supports single valued facets (Facet.java) multi valued facets (Facet.MultiValued.java) facet filters (see FacetSearcher.java) evaluation of facet values that would dismiss due to other facet filters (Yonik says Solr calls this "multi-select faceting"). (realized by FacetSearcher.fillFacetsForGuiMode()) Let me explain the last point: For the user a facet query   (color==green) AND (shape==circle OR shape==square) may look like Facet color [ ] (3) red [x] (5) green [ ] (7) blue Facet shape [x] (9) circle [ ] (4) line [x] (2) square The red/blue/line facet values will display even though the corresponding documents are not in the result set. Also there is support for filtered facet values with zero results, so users understand why they do not get results.
          Hide
          Yonik Seeley added a comment -

          if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same "needs"

          Sort of an aside, but not really.... specific applications are much easier. A lot more indirection is required in Solr and a schema is needed for pretty much everything. Without the schema, a client would specify "sort=foo desc" and Solr would have no idea how to do that. A specific application just does it because they have the knowledge of what all the fields are. It's why people have gotten along just fine without a schema in Lucene thus far. If you're building another Solr... yes, you need something like a schema.

          Show
          Yonik Seeley added a comment - if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same "needs" Sort of an aside, but not really.... specific applications are much easier. A lot more indirection is required in Solr and a schema is needed for pretty much everything. Without the schema, a client would specify "sort=foo desc" and Solr would have no idea how to do that. A specific application just does it because they have the knowledge of what all the fields are. It's why people have gotten along just fine without a schema in Lucene thus far. If you're building another Solr... yes, you need something like a schema.
          Hide
          Jason Rutherglen added a comment -

          Schemas should probably be a module that makes use of embedding the field types per-segment, this is something the faceting module could/should use. I think is what LUCENE-2308 is aiming for? Though I thought there was another Jira issue created by Simon for this as well.

          Show
          Jason Rutherglen added a comment - Schemas should probably be a module that makes use of embedding the field types per-segment, this is something the faceting module could/should use. I think is what LUCENE-2308 is aiming for? Though I thought there was another Jira issue created by Simon for this as well.
          Hide
          Chris Male added a comment -

          I don't think any Facet module needs to be concerned with Schemas. Instead the module can expose an API which asks for the information it needs to make the best choices. Solr can then provide that information based on its Schema, pure Lucene users can do it however they want.

          Show
          Chris Male added a comment - I don't think any Facet module needs to be concerned with Schemas. Instead the module can expose an API which asks for the information it needs to make the best choices. Solr can then provide that information based on its Schema, pure Lucene users can do it however they want.
          Hide
          Jason Rutherglen added a comment -

          I don't think any Facet module needs to be concerned with Schemas

          Right, they should be field type aware.

          Show
          Jason Rutherglen added a comment - I don't think any Facet module needs to be concerned with Schemas Right, they should be field type aware.
          Hide
          Shai Erera added a comment -

          I would like to contribute IBM's faceted search package (been wanting to do that for quite a while). The package supports the following features/capabilities (at a high-level):

          • Taxonomy index – manages trees of 'categories'. You can view example of taxonomies at e.g. the Open Directory Project.
            • It's a Lucene index managed alongside the content index.
            • Builds the taxonomy on-the-fly (i.e. as categories are discovered).
            • In general it maps a category hierarchy to ordinals (integers). For example, the category /date/2011/06/24 will create the following entry in the taxonomy index:
              • /date, ordinal=1
              • /date/2011, ordinal=2
              • /date/2011/06, ordinal=3
              • /date/2011/06/24, ordinal=4
          • FacetsDocumentBuilder which receives a list of categories that are associated w/ the document (can be of several dimensions) and:
            • Fetches the ordinals of the category components from the taxonomy index (adding them to it on-the-fly).
            • Indexes them in a (compressed) payload for the document (so for the above category example, 4 payloads will be indexed for the document).
            • FDB can be used to augment a Document with other fields for indexing (it adds its own Field objects).
          • FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and returns the top-K categories for each requested facet:
            • The root can denote any node in the category tree (e.g., 'count all facets under /date/2011')
            • top-K can be returned for the top most K immediate children of root, or any top-K in the sub-tree of root.
          • Counting algorithm (at a high-level):
            • Fetch the payload for every matching document.
            • Increment by 1 the count of every ordinal that was encountered (even for facets that were not requested by the user)
            • After all ordinals are counted, compute the top-K on the ones the user requested
            • Label the result facets
          • Miscellaneous features:
            • Sampling algorithm allows for more efficient facets counting/accumulation, while still returning exact counts for the top-K facets.
            • Complements algorithm allows for more efficient facets counting/accumulation, when the number of results is > 50% of the docs in the index (we keep a total count of facets, count facets on the docs that did not match the query and subtract).
              • Complements can be used to count facets that do not appear in any of the matching documents (of this result set). This does not exist in the package though ... yet.
            • Facets partitioning – if the taxonomy is huge (i.e. millions of categories), it is better to partition them at indexing time, so that search time is faster and consumes less memory. Note that this is required because of the approach of counting all (allocating a count array) and then keeping only the results of interest.
            • Category enhancements allow storing 'metadata' with categories in the index, so that more than simple counting can be implemented:
              • weighted facets (built on top of enhancements) allows associating a weight w/ each category, and use smarter counting techniques at runtime. For example, if facets are generated by an analytics component, the confidence level can be set as the category's weight. If tags are indexed as facets (for e.g. generating a tag cloud for the result set), the number of times the document was tagged by the tag can be set as the tag's weight.
            • That that facets are indexed in the payloads of documents allows managing very large taxonomies and indexes, without blowing up the RAM at runtime (but incur some search performance overhead). However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes much faster.
              • However facets are stored is abstracted though by a CountingList API, so we can definitely explore other means of storing the facet ordinals. Actually, the CountingList API allows us to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all.
            • I did not want to dive too deep on the API here, but the runtime API is very extensible and allows one to use FacetsCollector for the simple cases, or lower-level API to get more control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.
          • The package comes with example code which shows how to use the different features I've mentioned. There are also unit tests for ensuring the example code works .
          • The package comes with a very extensive tests suite and is in use by many of our products for a long time, so I can state that it's very stable.
          • Some rough performance numbers:
            • Collection of 1M documents, few hierarchical facets per document (we call it the 'Amazon' case)
            • Query matches 49.8% of the docs (~500K)
            • No sampling / complements were used.
            • Facets read from disk
            • Takes 0.4 seconds to execute the query and count facets.

          It will take me a few days to prepare a patch (lots of code) - will upload it as soon as it's ready.

          Show
          Shai Erera added a comment - I would like to contribute IBM's faceted search package (been wanting to do that for quite a while). The package supports the following features/capabilities (at a high-level): Taxonomy index – manages trees of 'categories'. You can view example of taxonomies at e.g. the Open Directory Project. It's a Lucene index managed alongside the content index. Builds the taxonomy on-the-fly (i.e. as categories are discovered). In general it maps a category hierarchy to ordinals (integers). For example, the category /date/2011/06/24 will create the following entry in the taxonomy index: /date, ordinal=1 /date/2011, ordinal=2 /date/2011/06, ordinal=3 /date/2011/06/24, ordinal=4 FacetsDocumentBuilder which receives a list of categories that are associated w/ the document (can be of several dimensions) and: Fetches the ordinals of the category components from the taxonomy index (adding them to it on-the-fly). Indexes them in a (compressed) payload for the document (so for the above category example, 4 payloads will be indexed for the document). FDB can be used to augment a Document with other fields for indexing (it adds its own Field objects). FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and returns the top-K categories for each requested facet: The root can denote any node in the category tree (e.g., 'count all facets under /date/2011') top-K can be returned for the top most K immediate children of root, or any top-K in the sub-tree of root. Counting algorithm (at a high-level): Fetch the payload for every matching document. Increment by 1 the count of every ordinal that was encountered (even for facets that were not requested by the user) After all ordinals are counted, compute the top-K on the ones the user requested Label the result facets Miscellaneous features: Sampling algorithm allows for more efficient facets counting/accumulation, while still returning exact counts for the top-K facets. Complements algorithm allows for more efficient facets counting/accumulation, when the number of results is > 50% of the docs in the index (we keep a total count of facets, count facets on the docs that did not match the query and subtract). Complements can be used to count facets that do not appear in any of the matching documents (of this result set). This does not exist in the package though ... yet. Facets partitioning – if the taxonomy is huge (i.e. millions of categories), it is better to partition them at indexing time, so that search time is faster and consumes less memory. Note that this is required because of the approach of counting all (allocating a count array) and then keeping only the results of interest. Category enhancements allow storing 'metadata' with categories in the index, so that more than simple counting can be implemented: weighted facets (built on top of enhancements) allows associating a weight w/ each category, and use smarter counting techniques at runtime. For example, if facets are generated by an analytics component, the confidence level can be set as the category's weight. If tags are indexed as facets (for e.g. generating a tag cloud for the result set), the number of times the document was tagged by the tag can be set as the tag's weight. That that facets are indexed in the payloads of documents allows managing very large taxonomies and indexes, without blowing up the RAM at runtime (but incur some search performance overhead). However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes much faster. However facets are stored is abstracted though by a CountingList API, so we can definitely explore other means of storing the facet ordinals. Actually, the CountingList API allows us to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all. I did not want to dive too deep on the API here, but the runtime API is very extensible and allows one to use FacetsCollector for the simple cases, or lower-level API to get more control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use. The package comes with example code which shows how to use the different features I've mentioned. There are also unit tests for ensuring the example code works . The package comes with a very extensive tests suite and is in use by many of our products for a long time, so I can state that it's very stable. Some rough performance numbers: Collection of 1M documents, few hierarchical facets per document (we call it the 'Amazon' case) Query matches 49.8% of the docs (~500K) No sampling / complements were used. Facets read from disk Takes 0.4 seconds to execute the query and count facets. It will take me a few days to prepare a patch (lots of code) - will upload it as soon as it's ready.
          Hide
          Jan Høydahl added a comment -

          Bravo Shai & IBM!

          Show
          Jan Høydahl added a comment - Bravo Shai & IBM!
          Hide
          Ryan McKinley added a comment -

          Bravo Shai & IBM!

          +1! This sounds awesome, and I hope will prove how modules will help lucene and solr

          Show
          Ryan McKinley added a comment - Bravo Shai & IBM! +1! This sounds awesome, and I hope will prove how modules will help lucene and solr
          Hide
          Toke Eskildsen added a comment -

          This is quite another design than the quarter-baked one I've proposed with SOLR-2412 (which is really just a thin wrapper around LUCENE-2369). While maintaining a sidecar index makes the workflow more complicated, I would expect that it is beneficial for re-open speed and scalability.

          Technical note: For hierarchical faceting, I find that it is possible to avoid storing all levels in the hierarchy. By maintaining two numbers for each tag, denoting the tag-level and the level for the previous tag that matches, only the relevant tags needs to be indexed (full explanation at https://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/).

          Kudos for contributing solid code. I am looking forward to seeing the patch.

          Show
          Toke Eskildsen added a comment - This is quite another design than the quarter-baked one I've proposed with SOLR-2412 (which is really just a thin wrapper around LUCENE-2369 ). While maintaining a sidecar index makes the workflow more complicated, I would expect that it is beneficial for re-open speed and scalability. Technical note: For hierarchical faceting, I find that it is possible to avoid storing all levels in the hierarchy. By maintaining two numbers for each tag, denoting the tag-level and the level for the previous tag that matches, only the relevant tags needs to be indexed (full explanation at https://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/ ). Kudos for contributing solid code. I am looking forward to seeing the patch.
          Hide
          Shai Erera added a comment -

          Thanks Toke for the pointer. I think it's very interesting. We've actually explored in the past storing just the category/leaf, instead of the entire hierarchy, in the document. The search response time was much slower than what I reported above (nearly 2x slowdown). While storing the entire hierarchy indeed consumes more space, it is more performing at search time, and we figure that space today is cheap, and usually search apps are more interested in faster search response times and are willing to spend some more time at indexing and analysis stages.

          Nevertheless, the link you provided proposes an interesting way to manage the hierarchy, and I think it's worth exploring at some point. Could be that it will perform better than how we managed it when we indexed just the leaf category for each document. We'd also need to see how to update the taxonomy on the go. For example, it describes that for A/B/C you know that its level is 3 (that's easy) and that the previous category/tag that matches (P) is A. But what if at some point A/B is added to a document? What happens to the data indexed for the doc w/ A/B/C, which now its previous matching category is A/B? It's not clear to me, but could be that I've missed the description in the proposal.

          I am very close to uploading the patch. Hopefully I'll upload it by the end of my day.

          Show
          Shai Erera added a comment - Thanks Toke for the pointer. I think it's very interesting. We've actually explored in the past storing just the category/leaf, instead of the entire hierarchy, in the document. The search response time was much slower than what I reported above (nearly 2x slowdown). While storing the entire hierarchy indeed consumes more space, it is more performing at search time, and we figure that space today is cheap, and usually search apps are more interested in faster search response times and are willing to spend some more time at indexing and analysis stages. Nevertheless, the link you provided proposes an interesting way to manage the hierarchy, and I think it's worth exploring at some point. Could be that it will perform better than how we managed it when we indexed just the leaf category for each document. We'd also need to see how to update the taxonomy on the go. For example, it describes that for A/B/C you know that its level is 3 (that's easy) and that the previous category/tag that matches (P) is A. But what if at some point A/B is added to a document? What happens to the data indexed for the doc w/ A/B/C, which now its previous matching category is A/B? It's not clear to me, but could be that I've missed the description in the proposal. I am very close to uploading the patch. Hopefully I'll upload it by the end of my day.
          Hide
          Toke Eskildsen added a comment -

          SOLR-2412/LUCENE-2369 were created with the trade-offs (relatively) long startup, low memory, high performance: When the index is (re)opened, the hierarchy is analyzed by iterating the terms (it could be offloaded to index-time, but it is still iterate-the-entire-term-list after each change). This does not play well with real-time, but should be a nice fit for large indexes with low update rate.

          As for speed, my theory is that the sparser hierarchy (only the concrete paths) wins due to less counting, but without another solution to compare against it has so far remained a theory. There are some measurements at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ but I find that for hierarchical faceting, small changes to test-setups can easily have vast implications on performance, so they are not comparable to your million-document test.

          Show
          Toke Eskildsen added a comment - SOLR-2412 / LUCENE-2369 were created with the trade-offs (relatively) long startup, low memory, high performance: When the index is (re)opened, the hierarchy is analyzed by iterating the terms (it could be offloaded to index-time, but it is still iterate-the-entire-term-list after each change). This does not play well with real-time, but should be a nice fit for large indexes with low update rate. As for speed, my theory is that the sparser hierarchy (only the concrete paths) wins due to less counting, but without another solution to compare against it has so far remained a theory. There are some measurements at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ but I find that for hierarchical faceting, small changes to test-setups can easily have vast implications on performance, so they are not comparable to your million-document test.
          Hide
          Shai Erera added a comment -

          Attached patch includes the faceted search module. It currently compiles
          against 3x, so I've put it under lucene/contrib, but after we port it to
          trunk, it should be under modules/.

          There isn't a short way to describe the content of the patch (as you can
          see, it's huge), so instead I'll give a brief overview of some key
          packages:

          • src/examples: contains code of different capabilities of the facets module. I'd start w/ examples/simple.
          • src/test: contains many tests, great place to start too
          • o.a.l.facet.taxonomy contains the taxonomy index management code. There are two interfaces TaxonomyWriter/Reader with a LuceneTW/TR impl
          • o.a.l.facet.index contains the indexing code of the different capabilities (e.g. simpl, enhancements etc.)
          • o.a.l.facet.search contains the respective search code

          Few points:

          • I've put the ASL on all files.
          • Marked all code as @lucene.experimental.
          • After you apply the patch you can run 'ant eclipse' and it will build the facets code in eclipse (no maven integration yet - will need to look into it)
          • Under o.a.l and o.a.l.util there are several utility classes that are not specific to facetted search, however the facets code uses them. I've kept them there so that we can review and decide whether we want to move them to lucene-core at some point.

          TODOs:

          • After it's on trunk, I think we should explore replacing the payloads w/ DocValues.
          • Leverage Lucene's superb random testing framework !
          • There are few TODOs in the code which I think can be addressed following this issue.

          I will open follow-on issues for those.

          Given the amount of code, I am wondering if perhaps we should commit it
          as-is, and do more thorough reviews afterwards. The code does not modify
          any existing code (aside from a tiny change to LuceneTestCase), so I
          think there's no risk in doing so. I am also not sure that it's sane to
          review that amount of code (nearly 40K lines) in patch form. What do you
          think?

          Show
          Shai Erera added a comment - Attached patch includes the faceted search module. It currently compiles against 3x, so I've put it under lucene/contrib, but after we port it to trunk, it should be under modules/. There isn't a short way to describe the content of the patch (as you can see, it's huge), so instead I'll give a brief overview of some key packages: src/examples: contains code of different capabilities of the facets module. I'd start w/ examples/simple. src/test: contains many tests, great place to start too o.a.l.facet.taxonomy contains the taxonomy index management code. There are two interfaces TaxonomyWriter/Reader with a LuceneTW/TR impl o.a.l.facet.index contains the indexing code of the different capabilities (e.g. simpl, enhancements etc.) o.a.l.facet.search contains the respective search code Few points: I've put the ASL on all files. Marked all code as @lucene.experimental. After you apply the patch you can run 'ant eclipse' and it will build the facets code in eclipse (no maven integration yet - will need to look into it) Under o.a.l and o.a.l.util there are several utility classes that are not specific to facetted search, however the facets code uses them. I've kept them there so that we can review and decide whether we want to move them to lucene-core at some point. TODOs: After it's on trunk, I think we should explore replacing the payloads w/ DocValues. Leverage Lucene's superb random testing framework ! There are few TODOs in the code which I think can be addressed following this issue. I will open follow-on issues for those. Given the amount of code, I am wondering if perhaps we should commit it as-is, and do more thorough reviews afterwards. The code does not modify any existing code (aside from a tiny change to LuceneTestCase), so I think there's no risk in doing so. I am also not sure that it's sane to review that amount of code (nearly 40K lines) in patch form. What do you think?
          Hide
          Chris Male added a comment -

          Great contribution Shai. What about putting it into a branch? I think it really does need a thorough review before we put it into trunk.

          Show
          Chris Male added a comment - Great contribution Shai. What about putting it into a branch? I think it really does need a thorough review before we put it into trunk.
          Hide
          Shai Erera added a comment - - edited

          We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process?

          I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.

          Show
          Shai Erera added a comment - - edited We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process? I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.
          Hide
          Robert Muir added a comment -

          just some trivial test modifications so the tests work with an unmodified LuceneTestCase:

          • in some cases, if an assertion failed it would print the seed... but LTC does this.
          • in other tests, the test wanted to repeat a random sequence, but instead of exposing LTC internals, the test just grabs random.nextLong, makes a new Random from this, and then resets it with .setSeed.
          Show
          Robert Muir added a comment - just some trivial test modifications so the tests work with an unmodified LuceneTestCase: in some cases, if an assertion failed it would print the seed... but LTC does this. in other tests, the test wanted to repeat a random sequence, but instead of exposing LTC internals, the test just grabs random.nextLong, makes a new Random from this, and then resets it with .setSeed.
          Hide
          Robert Muir added a comment -

          I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.

          +1

          My suggestion:

          1. commit to branch 3.x with @experimental.
          2. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
          3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.
          Show
          Robert Muir added a comment - I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it. +1 My suggestion: commit to branch 3.x with @experimental. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.
          Hide
          Shai Erera added a comment -

          Thanks Robert for the fix. This indeed looks better than patching LTC !

          Patch for dev-tools only, this time w/ Maven
          support too. I hope it works well .

          Show
          Shai Erera added a comment - Thanks Robert for the fix. This indeed looks better than patching LTC ! Patch for dev-tools only, this time w/ Maven support too. I hope it works well .
          Hide
          Shai Erera added a comment -

          My suggestion:

          1. commit to branch 3.x with @experimental.
          2. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
          3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.

          I agree. I'll give it a day or two before I commit, unless everyone agree it can be committed today, in which case I'll happily press the button .

          Show
          Shai Erera added a comment - My suggestion: 1. commit to branch 3.x with @experimental. 2. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. 3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue. I agree. I'll give it a day or two before I commit, unless everyone agree it can be committed today, in which case I'll happily press the button .
          Hide
          Toke Eskildsen added a comment -

          The patch compiles neatly against a 3x stable checkout and the examples worked fine. Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

          Show
          Toke Eskildsen added a comment - The patch compiles neatly against a 3x stable checkout and the examples worked fine. Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?
          Hide
          Shai Erera added a comment -

          Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

          It's not included. We wrote a proprietary (not extending Benchmark) test and would like to extend Lucene's benchmark for benchmarking facets. I think it will be interesting to define some facets on the Wikipedia collection, and benchmark it.

          Show
          Shai Erera added a comment - Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch? It's not included. We wrote a proprietary (not extending Benchmark) test and would like to extend Lucene's benchmark for benchmarking facets. I think it will be interesting to define some facets on the Wikipedia collection, and benchmark it.
          Hide
          Simon Willnauer added a comment -

          Shai, MASSIVE PATCH! phew I don't want to review this entire thing in a patch really but at a first glance this looks very good. Lots of tests, documentation etc. and obviously this has been used in production so its seen some cpu cycles I think roberts proposed way is good!

          +1

          My suggestion:

          commit to branch 3.x with @experimental.
          next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
          finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.

          here is my +1

          Show
          Simon Willnauer added a comment - Shai, MASSIVE PATCH! phew I don't want to review this entire thing in a patch really but at a first glance this looks very good. Lots of tests, documentation etc. and obviously this has been used in production so its seen some cpu cycles I think roberts proposed way is good! +1 My suggestion: commit to branch 3.x with @experimental. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue. here is my +1
          Hide
          Toke Eskildsen added a comment - - edited

          Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.

            LUCENE-3079 LUCENE-2369
          Index build time 52 s 23 s
          Memory required for indexing 192 MB 48 MB
          First facet request 432 ms 12,000 ms
          Best of 5 requests 228 ms 159 ms
          Memory usage after faceting (after gc()) 21 MB 22 MB

          Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in

            LUCENE-3079 LUCENE-2369
          Index build time 752 s 238 s
          Memory required for indexing 2500 MB 128 MB
          First facet request 3370 ms 147,000 ms
          Best of 5 requests 2400 ms 2673 ms
          Memory usage after faceting 435 MB 294 MB

          Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in

            LUCENE-3079 LUCENE-2369
          Index build time 5317 ms 2563 ms
          Memory required for indexing 48 MB 32 MB
          First facet request 245 ms 1425 ms
          Best of 5 requests 15 ms 8 ms
          Memory usage after faceting 1 MB 2 MB

          Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3079 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2369 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage.

          Now, LUCENE-2369 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3079 patches trunk, a proper comparison can be made.

          I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3079 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?

          Show
          Toke Eskildsen added a comment - - edited Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.   LUCENE-3079 LUCENE-2369 Index build time 52 s 23 s Memory required for indexing 192 MB 48 MB First facet request 432 ms 12,000 ms Best of 5 requests 228 ms 159 ms Memory usage after faceting (after gc()) 21 MB 22 MB Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in   LUCENE-3079 LUCENE-2369 Index build time 752 s 238 s Memory required for indexing 2500 MB 128 MB First facet request 3370 ms 147,000 ms Best of 5 requests 2400 ms 2673 ms Memory usage after faceting 435 MB 294 MB Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in   LUCENE-3079 LUCENE-2369 Index build time 5317 ms 2563 ms Memory required for indexing 48 MB 32 MB First facet request 245 ms 1425 ms Best of 5 requests 15 ms 8 ms Memory usage after faceting 1 MB 2 MB Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3079 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2369 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage. Now, LUCENE-2369 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3079 patches trunk, a proper comparison can be made. I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3079 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?
          Hide
          Shai Erera added a comment -

          You write LUCENE-3097 which is about "post group faceting", while this issue is LUCENE-3079. I assume you meant the latter, but want to confirm . You also write LUCENE-2309 which is about decoupling IW from Analyzers. Are you perhaps referring to a Solr issue, or a different Lucene issue? If so can you please let me know which one?

          This is a great test, and it matches more or less the test we've been running. Is it in 'benchmark' form? Can you post it on this issue so I can try the same?

          What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, where a dimensions is like "tags", "authors", "date", then do you mean you've requested to count 5 dimensions, or you indexed just one dimension (i.e. one "root") and requested to fetch the top-5 results for it? I assume it's the latter, but again, confirming my understanding.

          So assuming I understood correctly the terminology and test setup, you execute one query which matches 50% of the documents and ask to count the top-5 facets under a single "root"/"dimension", and record the time as 'first facet request'. And then you execute it 4-5 additional times, and record 'best of 5 requests'. Do I understand it correctly?

          One difference between the two approaches, assuming you're referring to a faceting approach that uses the FieldCache is that by default, the faceting approach here reads everything from disk. So it would be interesting to run w/ the facets-in-memory feature.

          I don't know how to relate to the memory usage – on the last test it consumed 50% less than the other approach, on the first it consumed nearly the same and on the second test it consumed 150% more. This is odd. Do you trust this measurement?

          The 'first facet request' result is not surprising, because it takes time to warm up the FieldCache (assuming that's what you use).

          I am interested in the memory observed for indexing because that too seems fluctuating? I.e., in the second test the difference is nearly x20 more, which is weird.

          Also, the difference in indexing time is interesting too, as it too is not very consistent. And I find the x2 factor suspicious - would like to understand it better. Since trunk reports to improve indexing speed by a large factor (nearly 200%), I think it will be wise if we wait with this comparison until I bring the patch up w/ trunk.

          I like it that you test the default behavior. I think it's very important that we have the greatest out-of-the-box experience. Since the two approaches read from disk/memory, I first would like to test the in-memory facets using this approach, so we can at least compare the same thing. I know that trunk plays some role here (definitely at indexing time), so we can focus on search time for now.

          This is great stuff Toke !

          Show
          Shai Erera added a comment - You write LUCENE-3097 which is about "post group faceting", while this issue is LUCENE-3079 . I assume you meant the latter, but want to confirm . You also write LUCENE-2309 which is about decoupling IW from Analyzers. Are you perhaps referring to a Solr issue, or a different Lucene issue? If so can you please let me know which one? This is a great test, and it matches more or less the test we've been running. Is it in 'benchmark' form? Can you post it on this issue so I can try the same? What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, where a dimensions is like "tags", "authors", "date", then do you mean you've requested to count 5 dimensions, or you indexed just one dimension (i.e. one "root") and requested to fetch the top-5 results for it? I assume it's the latter, but again, confirming my understanding. So assuming I understood correctly the terminology and test setup, you execute one query which matches 50% of the documents and ask to count the top-5 facets under a single "root"/"dimension", and record the time as 'first facet request'. And then you execute it 4-5 additional times, and record 'best of 5 requests'. Do I understand it correctly? One difference between the two approaches, assuming you're referring to a faceting approach that uses the FieldCache is that by default, the faceting approach here reads everything from disk. So it would be interesting to run w/ the facets-in-memory feature. I don't know how to relate to the memory usage – on the last test it consumed 50% less than the other approach, on the first it consumed nearly the same and on the second test it consumed 150% more. This is odd. Do you trust this measurement? The 'first facet request' result is not surprising, because it takes time to warm up the FieldCache (assuming that's what you use). I am interested in the memory observed for indexing because that too seems fluctuating? I.e., in the second test the difference is nearly x20 more, which is weird. Also, the difference in indexing time is interesting too, as it too is not very consistent. And I find the x2 factor suspicious - would like to understand it better. Since trunk reports to improve indexing speed by a large factor (nearly 200%), I think it will be wise if we wait with this comparison until I bring the patch up w/ trunk. I like it that you test the default behavior. I think it's very important that we have the greatest out-of-the-box experience. Since the two approaches read from disk/memory, I first would like to test the in-memory facets using this approach, so we can at least compare the same thing. I know that trunk plays some role here (definitely at indexing time), so we can focus on search time for now. This is great stuff Toke !
          Hide
          Shai Erera added a comment -

          Edited the issue title (it had an extra 'i' in 'faceting'), as well as added fix versions etc. I intend to commit the patch to the 3x branch either later today or tomorrow.

          Show
          Shai Erera added a comment - Edited the issue title (it had an extra 'i' in 'faceting'), as well as added fix versions etc. I intend to commit the patch to the 3x branch either later today or tomorrow.
          Hide
          Shai Erera added a comment -

          Created component modules/facet

          Show
          Shai Erera added a comment - Created component modules/facet
          Hide
          Toke Eskildsen added a comment -

          Shai: I completely messed up the JIRA-numbers, clearly I need to go home and cool my brain. It is fixed now. Sorry for the inconvenience.

          Yes, I only created a single root (one dimension) and requested the top-5 facets. You understood the timing measurements correctly. I am sorry that my table was confusing with regards to memory. The first numbers was the Xmx required for index build (binary search until I got bored), while the second if what the JVM reported after the faceting calls were finished. For LUCENE-2369 in the middle test, the faceted search required more memory (aka higher Xmx) than index build (which could probably have gotten by with even less).

          I do not use field cance for LUCENE-2369. It holds a compressed list of ordinals for tags for the documents in memory, with a few levels of indirections to handle doublettes. The startup time is basically due to doublette elimination.

          Regarding the memory difference, LUCENE-2369 does not operate at index-time. This means that is it plain Lucene indexing of terms like hierarchy:a/b/c/d. Actually I am surprised that it took 128MB for the larger test and I should probably re-run that with a lower allocation.

          My guesstimage, based purely on observation, is that LUCENE-3079 requires heap relative to the taxonomy size at indexing time. At least with the (assumedly default) settings I used. Thus the 22M unique values in test #2 is the cause for the large memory requirement. Looking at the number of unique tags vs. index memory requirements for case #1 and #2, the factor seems nearly linear. It seems to fit your recommendation of splitting on large taxonomies?

          I'll upload my test class for LUCENE-3079 now. I apologize for its hackish nature - this was just meant as explorative work.

          Show
          Toke Eskildsen added a comment - Shai: I completely messed up the JIRA-numbers, clearly I need to go home and cool my brain. It is fixed now. Sorry for the inconvenience. Yes, I only created a single root (one dimension) and requested the top-5 facets. You understood the timing measurements correctly. I am sorry that my table was confusing with regards to memory. The first numbers was the Xmx required for index build (binary search until I got bored), while the second if what the JVM reported after the faceting calls were finished. For LUCENE-2369 in the middle test, the faceted search required more memory (aka higher Xmx) than index build (which could probably have gotten by with even less). I do not use field cance for LUCENE-2369 . It holds a compressed list of ordinals for tags for the documents in memory, with a few levels of indirections to handle doublettes. The startup time is basically due to doublette elimination. Regarding the memory difference, LUCENE-2369 does not operate at index-time. This means that is it plain Lucene indexing of terms like hierarchy:a/b/c/d. Actually I am surprised that it took 128MB for the larger test and I should probably re-run that with a lower allocation. My guesstimage, based purely on observation, is that LUCENE-3079 requires heap relative to the taxonomy size at indexing time. At least with the (assumedly default) settings I used. Thus the 22M unique values in test #2 is the cause for the large memory requirement. Looking at the number of unique tags vs. index memory requirements for case #1 and #2, the factor seems nearly linear. It seems to fit your recommendation of splitting on large taxonomies? I'll upload my test class for LUCENE-3079 now. I apologize for its hackish nature - this was just meant as explorative work.
          Hide
          Toke Eskildsen added a comment -

          A quick hack to get an idea of performance characteristics for hierarchical faceting.

          Show
          Toke Eskildsen added a comment - A quick hack to get an idea of performance characteristics for hierarchical faceting.
          Hide
          Shai Erera added a comment -

          resulting in 1.4M unique paths

          AND

          LUCENE-3079 requires heap relative to the taxonomy size at indexing time

          Ok now I see what's happening. The taxonomy index maintains a category cache, which maps a category to its ordinal. It's maintained at both indexing and search time. At indexing time, it's used to quickly determine if an incoming category already exists, and if so returns its ordinal. At search time its used to quickly label ordinals, as well as determining hierarchy information.

          The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, which maintains a mapping of all categories to their ordinal, although the mapping is maintained compact. There is another cache LruTaxonomyWriterCache which as its javadocs states "good choice for huge taxonomies".

          The taxonomy you create is HUGE by all standards (and I'm not speaking on the 22M case ). The largest 'normal' taxonomy we've seen was in the order of few 100K nodes (when people names were maintained, and social tags), while the largest 'abnormal' taxonomy we've seen contained ~5M nodes, and that is considered a very extreme case for taxonomies.

          Just to be clear, I'm not trying to make excuses. Your perf test is still great in that it tests an extreme case. But I'd expect an extreme case to require some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of let's say 100/200K). Another thing I've mentioned this package can do is the partitions – because at runtime we allocate an array the size of the taxonomy, in your case (1.4M nodes), we'll create an array that is ~6MB size for every query. While if you use partitions, and partition the categories into 10/100K-categories buckets, you'd allocate less RAM, but might incur some search performance overhead.

          Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are assigned to each document and how many documents are associated w/ each category? When we ran our test, we used a Zipf distribution for categories. If in this test we end up associating only a couple of documents per category, then this is not a too realistic scenario. And while the package can handle it, by not using defaults, perhaps we should define a scenario that makes sense (a common one that is) and run with it?

          I don't think there can be one "right" faceted search solution, but rather a collection of tools that can match different scenarios. And if it turns out that for one case one implementation is better than another, then our job will be to create a faceted search layer which allows the user to choose what's best for him, and leave the rest of his app code unmodified.

          What do you think? Perhaps we should open a separate issue, let's call it "facet benchmarking", where we define some scenarios, work on extending the benchmark package (we have done some preliminary work there) and then compare few approaches?

          Show
          Shai Erera added a comment - resulting in 1.4M unique paths AND LUCENE-3079 requires heap relative to the taxonomy size at indexing time Ok now I see what's happening. The taxonomy index maintains a category cache, which maps a category to its ordinal. It's maintained at both indexing and search time. At indexing time, it's used to quickly determine if an incoming category already exists, and if so returns its ordinal. At search time its used to quickly label ordinals, as well as determining hierarchy information. The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, which maintains a mapping of all categories to their ordinal, although the mapping is maintained compact. There is another cache LruTaxonomyWriterCache which as its javadocs states "good choice for huge taxonomies". The taxonomy you create is HUGE by all standards (and I'm not speaking on the 22M case ). The largest 'normal' taxonomy we've seen was in the order of few 100K nodes (when people names were maintained, and social tags), while the largest 'abnormal' taxonomy we've seen contained ~5M nodes, and that is considered a very extreme case for taxonomies. Just to be clear, I'm not trying to make excuses. Your perf test is still great in that it tests an extreme case. But I'd expect an extreme case to require some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of let's say 100/200K). Another thing I've mentioned this package can do is the partitions – because at runtime we allocate an array the size of the taxonomy, in your case (1.4M nodes), we'll create an array that is ~6MB size for every query. While if you use partitions, and partition the categories into 10/100K-categories buckets, you'd allocate less RAM, but might incur some search performance overhead. Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are assigned to each document and how many documents are associated w/ each category? When we ran our test, we used a Zipf distribution for categories. If in this test we end up associating only a couple of documents per category, then this is not a too realistic scenario. And while the package can handle it, by not using defaults, perhaps we should define a scenario that makes sense (a common one that is) and run with it? I don't think there can be one "right" faceted search solution, but rather a collection of tools that can match different scenarios. And if it turns out that for one case one implementation is better than another, then our job will be to create a faceted search layer which allows the user to choose what's best for him, and leave the rest of his app code unmodified. What do you think? Perhaps we should open a separate issue, let's call it "facet benchmarking", where we define some scenarios, work on extending the benchmark package (we have done some preliminary work there) and then compare few approaches?
          Hide
          Shai Erera added a comment -

          About the TaxonomyWriterCache, looking at its API now, I think that an FST-based TWC might be a good fit here? FST is known for its performance and low RAM consumption, and TWC maps from a CategoryPath (or String) to an integer, which sounds like a typical usage for FST. So we can have both the keep-all-in-RAM-TWC and LRU use FST and consume less memory.

          I'll open an issue for that.

          One small correction - TWC is used only at indexing time, mapping from category->ordinal. For labeling ordinals you use TaxonomyReader which maintains its own int->String cache (LRU). Can FST aid in that case as well? I assume it will consume less space than an Integer->String hash map.

          -------------

          Back to performance – Toke, did you verify that you get the same top-5 categories from both implementations? Also, can you try running the test asking for top-5 categories of a node below the root? I.e., if the paths are in the form /a/b/c/d and you request to count "/a", then I'm interested in how this performs if you ask to count "/a/b".

          Show
          Shai Erera added a comment - About the TaxonomyWriterCache, looking at its API now, I think that an FST-based TWC might be a good fit here? FST is known for its performance and low RAM consumption, and TWC maps from a CategoryPath (or String) to an integer, which sounds like a typical usage for FST. So we can have both the keep-all-in-RAM-TWC and LRU use FST and consume less memory. I'll open an issue for that. One small correction - TWC is used only at indexing time, mapping from category->ordinal. For labeling ordinals you use TaxonomyReader which maintains its own int->String cache (LRU). Can FST aid in that case as well? I assume it will consume less space than an Integer->String hash map. ------------- Back to performance – Toke, did you verify that you get the same top-5 categories from both implementations? Also, can you try running the test asking for top-5 categories of a node below the root? I.e., if the paths are in the form /a/b/c/d and you request to count "/a", then I'm interested in how this performs if you ask to count "/a/b".
          Hide
          Shai Erera added a comment -

          Oh Toke ... I just reviewed your test and there is a problem in it:

              CountFacetRequest facetRequest = new CountFacetRequest(
                  new CategoryPath(HIERARCHICAL), num);
              facetRequest.setDepth(5);
          

          You create a CountFacetRequest, requesting to count HIERARCHICAL (which is the root), and fetch the top <num>, which is 5. BUT, you set the depth of the request to 5, which means it will compute the top-5 categories at each level !

          This is a nice feature of the package, which lets you get not only the top-N child nodes of the immediate "root", but also the top-N of their child nodes and it's applied recursively until 'depth'. This is a nice feature, but not very performance friendly .

          Can you please rerun the test then, commenting out that line? I can run it on my laptop, but I don't have the env. setup w/ the patch from LUCENE-2369, so I cannot compare.

          In general, this looks like a very useful test. I think we can commit it too, but rename it so that it doesn't run regularly w/ our tests, but rather selectively.

          Show
          Shai Erera added a comment - Oh Toke ... I just reviewed your test and there is a problem in it: CountFacetRequest facetRequest = new CountFacetRequest( new CategoryPath(HIERARCHICAL), num); facetRequest.setDepth(5); You create a CountFacetRequest, requesting to count HIERARCHICAL (which is the root), and fetch the top <num>, which is 5. BUT, you set the depth of the request to 5, which means it will compute the top-5 categories at each level ! This is a nice feature of the package, which lets you get not only the top-N child nodes of the immediate "root", but also the top-N of their child nodes and it's applied recursively until 'depth'. This is a nice feature, but not very performance friendly . Can you please rerun the test then, commenting out that line? I can run it on my laptop, but I don't have the env. setup w/ the patch from LUCENE-2369 , so I cannot compare. In general, this looks like a very useful test. I think we can commit it too, but rename it so that it doesn't run regularly w/ our tests, but rather selectively.
          Hide
          Toke Eskildsen added a comment -

          I must admit that my choices of tests were more aimed at probing edge cases than simulating real taxonomies. Nevertheless, it is good to hear that LUCENE-3079 can be tweaked to handle it. I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page?

          As for the 1M test case, the number of tags/documents were random with the average being the stated 3.5 tags/document. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag.

          I did verify the facet-results and they did fit my expectations. I will try requesting from further down the tree later - hopefully tomorrow. I am a bit confused about your protest on depth=5, but I suspect that we have different ideas of what is relevant when issuing a hierarchical request. The API states that specifying a depth of 5 will count all sub-tags until depth 5. I used the number 5 to effectively count all the way to the bottom (whoops! It should be 6 for the second case. That might explain why LUCENE-3079 was faster than LUCENE-2369 in that one as LUCENE-2369 counted to the bottom). The reason for the complete counting was that I implicitly found this to be the "correct" behavior, internally visioning a taxonomy of species or something similar, with the wish to get the number of unique elements at the finest level.

          Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

          All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

          Show
          Toke Eskildsen added a comment - I must admit that my choices of tests were more aimed at probing edge cases than simulating real taxonomies. Nevertheless, it is good to hear that LUCENE-3079 can be tweaked to handle it. I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page? As for the 1M test case, the number of tags/documents were random with the average being the stated 3.5 tags/document. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag. I did verify the facet-results and they did fit my expectations. I will try requesting from further down the tree later - hopefully tomorrow. I am a bit confused about your protest on depth=5, but I suspect that we have different ideas of what is relevant when issuing a hierarchical request. The API states that specifying a depth of 5 will count all sub-tags until depth 5. I used the number 5 to effectively count all the way to the bottom (whoops! It should be 6 for the second case. That might explain why LUCENE-3079 was faster than LUCENE-2369 in that one as LUCENE-2369 counted to the bottom). The reason for the complete counting was that I implicitly found this to be the "correct" behavior, internally visioning a taxonomy of species or something similar, with the wish to get the number of unique elements at the finest level. Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously. All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.
          Hide
          Shai Erera added a comment -

          I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page?

          Yes. And on the Wiki page you should clearly describe the scenario that is tested, along with results for 'default config' + 'optimized config'. That way, a user coming to the page can pick the scenario that best matches his app and config the facets package (whatever we end up with) accordingly.

          Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag

          That's indeed an extreme case. We've seen it when an analytics module extracted facets automatically from documents (such as places, people etc.), and in that case the taxonomy was very 'flat' and wide.

          I am a bit confused about your protest on depth=5

          I did not protest . Actually, the common scenario is to count the immediate children and fetch the top-K. And I thought that that's what you do in LUCENE-2369. But counting all the way down is a valid scenario - and shows another reason why we should have a benchmark page with clear description.

          Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

          It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

          All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

          Indeed. And we shouldn't pursue only 'realistic data', but edge cases too. As long as everything is clearly documented, it should be easy to interpret results.

          I think that setting up facet benchmarking is more important than working on improving any implementation. Mostly because it will allow measuring the how much the improvements really improved. I'll open an issue for that.

          Show
          Shai Erera added a comment - I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page? Yes. And on the Wiki page you should clearly describe the scenario that is tested, along with results for 'default config' + 'optimized config'. That way, a user coming to the page can pick the scenario that best matches his app and config the facets package (whatever we end up with) accordingly. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag That's indeed an extreme case. We've seen it when an analytics module extracted facets automatically from documents (such as places, people etc.), and in that case the taxonomy was very 'flat' and wide. I am a bit confused about your protest on depth=5 I did not protest . Actually, the common scenario is to count the immediate children and fetch the top-K. And I thought that that's what you do in LUCENE-2369 . But counting all the way down is a valid scenario - and shows another reason why we should have a benchmark page with clear description. Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously. It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K. All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data. Indeed. And we shouldn't pursue only 'realistic data', but edge cases too. As long as everything is clearly documented, it should be easy to interpret results. I think that setting up facet benchmarking is more important than working on improving any implementation. Mostly because it will allow measuring the how much the improvements really improved. I'll open an issue for that.
          Hide
          Toke Eskildsen added a comment - - edited

          I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for LUCENE-2369, but again this is just plain Lucene indexing.

            LUCENE-3079 (d6) LUCENE-3079 (d1) LUCENE-3079 (d1+'deep/a/b') LUCENE-2369 (d6)
          Index build time 752 s 771 s
          347 s
          Memory required for indexing 2500 MB 2500 MB 2500 MB 48 MB
          First facet request 3840 ms 1929 ms 1963 ms 147,000 ms
          Best of 5 requests 2688 ms 1172 ms 1246 ms 2673 ms
          Memory usage after faceting 435 MB 435 MB 435 MB 294 MB

          Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not.

          It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

          LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite).

          Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)

            LUCENE-3079 (d4) LUCENE-3079 (d1) LUCENE-2369 (d6)
          Index build time 28 min
          17 min
          Memory required for indexing ? MB ? MB ? MB
          First facet request 13933 ms 13367 ms 46,000 ms
          Best of 5 requests 11718 ms 11036 ms 2989 ms
          Memory usage after faceting 240 MB 240 MB 475 MB
          Show
          Toke Eskildsen added a comment - - edited I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for LUCENE-2369 , but again this is just plain Lucene indexing.   LUCENE-3079 (d6) LUCENE-3079 (d1) LUCENE-3079 (d1+'deep/a/b') LUCENE-2369 (d6) Index build time 752 s 771 s 347 s Memory required for indexing 2500 MB 2500 MB 2500 MB 48 MB First facet request 3840 ms 1929 ms 1963 ms 147,000 ms Best of 5 requests 2688 ms 1172 ms 1246 ms 2673 ms Memory usage after faceting 435 MB 435 MB 435 MB 294 MB Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not. It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K. LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite). Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)   LUCENE-3079 (d4) LUCENE-3079 (d1) LUCENE-2369 (d6) Index build time 28 min 17 min Memory required for indexing ? MB ? MB ? MB First facet request 13933 ms 13367 ms 46,000 ms Best of 5 requests 11718 ms 11036 ms 2989 ms Memory usage after faceting 240 MB 240 MB 475 MB
          Hide
          Shai Erera added a comment -

          Committed revision 1141060 (3x). Will start the port to trunk.

          I'm also attaching a userguide we wrote which should help newcomers get up to speed w/ the package. It is not meant to be an end-to-end cover of all the functionality and API, but rather as a complementary asset to the Javadocs, example code and source code itself.

          I think it will be good if we check it in with the source, e.g. under contrib/facet/docs or something, in ODT (+PDF?) format, and that it will be included in the release binaries (i.e. along with the .jar).

          What do you think?

          Show
          Shai Erera added a comment - Committed revision 1141060 (3x). Will start the port to trunk. I'm also attaching a userguide we wrote which should help newcomers get up to speed w/ the package. It is not meant to be an end-to-end cover of all the functionality and API, but rather as a complementary asset to the Javadocs, example code and source code itself. I think it will be good if we check it in with the source, e.g. under contrib/facet/docs or something, in ODT (+PDF?) format, and that it will be included in the release binaries (i.e. along with the .jar). What do you think?
          Hide
          Robert Muir added a comment -

          here's a hack patch, tried to quickly move this thing to trunk apis... 3 out of the 250 tests fail though, I'm 99% positive i jacked something up in the taxonomywriter (this is the most complicated one to convert), so maybe that one should just be started over.

          but maybe some of the patch (except taxonomywriter, again i think its broken) would be useful in getting it ported to 4.x

          Show
          Robert Muir added a comment - here's a hack patch, tried to quickly move this thing to trunk apis... 3 out of the 250 tests fail though, I'm 99% positive i jacked something up in the taxonomywriter (this is the most complicated one to convert), so maybe that one should just be started over. but maybe some of the patch (except taxonomywriter, again i think its broken) would be useful in getting it ported to 4.x
          Hide
          Robert Muir added a comment -

          updated patch, one of the fails was caused by my use of seekExact, i think at least for MultiTermsEnum, if you seekExact, and even if it finds your term, its not positioned correctly, so if you then call next() its unsafe.

          another one of the fails, is calling getPayload() before hasPayload() will not work with SepCodec.

          thanks to mike for helping track some of these down.

          i also added to build.xml the logic to depend on the analyzers module, now all tests pass (some of the time).

          but i have at least one random fail:NOTE: reproduce with: ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtestmethod=testTaxonomyMergeUtils -Dtests.seed=3732021887561370529:1102439953879128238

          Show
          Robert Muir added a comment - updated patch, one of the fails was caused by my use of seekExact, i think at least for MultiTermsEnum, if you seekExact, and even if it finds your term, its not positioned correctly, so if you then call next() its unsafe. another one of the fails, is calling getPayload() before hasPayload() will not work with SepCodec. thanks to mike for helping track some of these down. i also added to build.xml the logic to depend on the analyzers module, now all tests pass (some of the time). but i have at least one random fail:NOTE: reproduce with: ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtestmethod=testTaxonomyMergeUtils -Dtests.seed=3732021887561370529:1102439953879128238
          Hide
          Robert Muir added a comment -

          the previous fail is somehow a bug in memorycodec (the seed randomly selected it):
          ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtests.codec=Memory

          Show
          Robert Muir added a comment - the previous fail is somehow a bug in memorycodec (the seed randomly selected it): ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtests.codec=Memory
          Hide
          Michael McCandless added a comment -

          the previous fail is somehow a bug in memorycodec (the seed randomly selected it)

          I just committed a fix for this; it was because .getPayload() in MemoryCodec was (incorrectly) assuming caller did not change the .bytes of the returned BytesRef between calls.

          Show
          Michael McCandless added a comment - the previous fail is somehow a bug in memorycodec (the seed randomly selected it) I just committed a fix for this; it was because .getPayload() in MemoryCodec was (incorrectly) assuming caller did not change the .bytes of the returned BytesRef between calls.
          Hide
          Shai Erera added a comment -

          Thanks guys for doing this port so quickly. The patch looks good. I suggest that we change the 'nocommit' to TODO (Facet): and commit it (under modules/). Then we can iterate on the TODOs and resolve them one by one, in followup issues. Makes sense?

          Robert, would you like to do the honors?

          Show
          Shai Erera added a comment - Thanks guys for doing this port so quickly. The patch looks good. I suggest that we change the 'nocommit' to TODO (Facet): and commit it (under modules/). Then we can iterate on the TODOs and resolve them one by one, in followup issues. Makes sense? Robert, would you like to do the honors?
          Hide
          Robert Muir added a comment -

          updated patch: all tests pass.

          I changed the nocommits to TODO (Facet):'s, and added verbage for the reason to each one.

          we also have two TODOs for two bugs (The MTE.seekExact and SepCodec hasPayload bug) that we should fix, but currently we have workarounds in place (when we fix these bugs we can then remove the workarounds).

          I'll svn move to modules, and doublecheck things like javadoc warnings, and commit later today.

          Show
          Robert Muir added a comment - updated patch: all tests pass. I changed the nocommits to TODO (Facet):'s, and added verbage for the reason to each one. we also have two TODOs for two bugs (The MTE.seekExact and SepCodec hasPayload bug) that we should fix, but currently we have workarounds in place (when we fix these bugs we can then remove the workarounds). I'll svn move to modules, and doublecheck things like javadoc warnings, and commit later today.
          Hide
          Robert Muir added a comment -

          Committed revision 1141246.

          I think we should close this issue soon, and open followup issues?
          Maybe just start with a separate issue for the documentation guide?

          Show
          Robert Muir added a comment - Committed revision 1141246. I think we should close this issue soon, and open followup issues? Maybe just start with a separate issue for the documentation guide?
          Hide
          Shai Erera added a comment - - edited

          I opened LUCENE-3261 and LUCENE-3262 to track userguide + benchmark issues. Will update more as we go along.

          I think we should close this issue

          +1.

          We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list?

          Great job at porting to trunk Robert !

          Show
          Shai Erera added a comment - - edited I opened LUCENE-3261 and LUCENE-3262 to track userguide + benchmark issues. Will update more as we go along. I think we should close this issue +1. We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list? Great job at porting to trunk Robert !
          Hide
          Shai Erera added a comment -

          Faceting module in 3.x and trunk, tests pass, opened follow up issues. I think we can close this.

          Thanks for everyone for helping get this in so quickly !

          Show
          Shai Erera added a comment - Faceting module in 3.x and trunk, tests pass, opened follow up issues. I think we can close this. Thanks for everyone for helping get this in so quickly !
          Hide
          Bill Bell added a comment -

          Has this been integrated into SOLR? Even if we just add it to SOLR as an option with hooks would be good. Or as an override in the lower levels?

          Show
          Bill Bell added a comment - Has this been integrated into SOLR? Even if we just add it to SOLR as an option with hooks would be good. Or as an override in the lower levels?
          Hide
          Michael McCandless added a comment -

          Has this been integrated into SOLR?

          Hi Bill,

          No, not yet ... the separate taxonomy index makes it tricky. Although, the new facet method based on SortedSetDocValues is being used in Solr and is a patch (LUCENE-4795) to add to the faceting module, so there's a small overlap there...

          Show
          Michael McCandless added a comment - Has this been integrated into SOLR? Hi Bill, No, not yet ... the separate taxonomy index makes it tricky. Although, the new facet method based on SortedSetDocValues is being used in Solr and is a patch ( LUCENE-4795 ) to add to the faceting module, so there's a small overlap there...

            People

            • Assignee:
              Shai Erera
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development