Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Labels:
    • Environment:

      Fast IO when huge hierarchies are used

      Description

      Hierarchical faceting with slow startup, low memory overhead and fast response. Distinguishing features as compared to SOLR-64 and SOLR-792 are

      • Multiple paths per document
      • Query-time analysis of the facet-field; no special requirements for indexing besides retaining separator characters in the terms used for faceting
      • Optional custom sorting of tag values
      • Recursive counting of references to tags at all levels of the output

      This is a shell around LUCENE-2369, making it work with the Solr API. The underlying principle is to reference terms by their ordinals and create an index wide documents to tags map, augmented with a compressed representation of hierarchical levels.

      1. SOLR-2412.patch
        1.31 MB
        Toke Eskildsen
      2. SOLR-2412.patch
        1.27 MB
        Toke Eskildsen
      3. SOLR-2412.patch
        1.15 MB
        Toke Eskildsen
      4. SOLR-2412.patch
        1.01 MB
        Toke Eskildsen
      5. SOLR-2412.patch
        993 kB
        Toke Eskildsen
      6. SOLR-2412.patch
        993 kB
        Toke Eskildsen
      7. SOLR-2412.patch
        617 kB
        Toke Eskildsen

        Activity

        Toke Eskildsen created issue -
        Hide
        Toke Eskildsen added a comment -

        Alpha-level patch (aka Proof Of Concept). Works with trunk@1066767

        Test by doing

        svn co http://svn.apache.org/repos/asf/lucene/dev/trunk@1066767 solr-2412
        cd solr-2412
        patch -p0 < SOLR-2412.patch
        

        and follow the further instructions in solr/contrib/exposed/README.txt

        Show
        Toke Eskildsen added a comment - Alpha-level patch (aka Proof Of Concept). Works with trunk@1066767 Test by doing svn co http: //svn.apache.org/repos/asf/lucene/dev/trunk@1066767 solr-2412 cd solr-2412 patch -p0 < SOLR-2412.patch and follow the further instructions in solr/contrib/exposed/README.txt
        Toke Eskildsen made changes -
        Field Original Value New Value
        Attachment SOLR-2412.patch [ 12473113 ]
        Hide
        Lance Norskog added a comment -

        This is very nice. Great work!

        Show
        Lance Norskog added a comment - This is very nice. Great work!
        Hide
        Toke Eskildsen added a comment -

        The syntax for calling is kept close to SOLR-64 and SOLR-792. The essential commands are qt=exprh&efacet=true to activate faceting, efacet.hierarchical=true&efacet.field=mypath for hierarchical.

        Sorting is controlled with efacet.sort=count|index|locale. If locale is chosen, the locale is selected with efacet.sort.locale=da. The result set is limited with efacet.hierarchical.levels=99 and efacet.limit=100 to control the maximum depth and the maximum number of entries at each level.

        Example:

        http://localhost:8983/solr/select/?q=*:*&rows=0&fl=id&indent=0n&qt=exprh&efacet=true&efacet.field=path_ss&efacet.hierarchical=true&efacet.hierarchical.levels=99&efacet.limit=10
        
        <?xml version="1.0" encoding="UTF-8"?>
        <response>
        
        <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">204</int>
        </lst>
        <result name="response" numFound="1000000" start="0">
          <doc>
            <str name="id">1</str>
          </doc>
        </result>
        <lst name="efacet_counts">
          <lst name="efacet_fields">
            <lst name="path_ss">
              <str name="field">path_ss</str>
              <lst name="paths">
                <long name="recursivecount">1000000</long>
                <long name="potentialtags">1000000</long>
                <long name="totaltags">101</long>
                <long name="count">101</long>
                <int name="level">0</int>
                <lst name="sub">
                  <lst name="L0_T1">
                    <int name="count">1</int>
                    <lst name="sub">
                      <long name="recursivecount">9901</long>
                      <long name="potentialtags">9901</long>
                      <long name="totaltags">103</long>
                      <long name="count">103</long>
                      <int name="level">1</int>
                      <lst name="sub">
                        <lst name="L1_T1">
                          <int name="count">1</int>
                          <lst name="sub">
                            <long name="recursivecount">97</long>
                            <long name="potentialtags">97</long>
                            <long name="totaltags">97</long>
                            <long name="count">97</long>
                            <int name="level">2</int>
                            <lst name="sub">
                              <lst name="L2_T1">
                                <int name="count">1</int>
                              </lst>
        ...
        

        I'm currently doing some performance (memory and speed) comparisons of SOLR-64, SOLR-792 and SOLR-2412, which will be added later.

        Show
        Toke Eskildsen added a comment - The syntax for calling is kept close to SOLR-64 and SOLR-792 . The essential commands are qt=exprh&efacet=true to activate faceting, efacet.hierarchical=true&efacet.field=mypath for hierarchical. Sorting is controlled with efacet.sort=count|index|locale . If locale is chosen, the locale is selected with efacet.sort.locale=da . The result set is limited with efacet.hierarchical.levels=99 and efacet.limit=100 to control the maximum depth and the maximum number of entries at each level. Example: http: //localhost:8983/solr/select/?q=*:*&rows=0&fl=id&indent=0n&qt=exprh&efacet= true &efacet.field=path_ss&efacet.hierarchical= true &efacet.hierarchical.levels=99&efacet.limit=10 <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > < int name= "status" >0</ int > < int name= "QTime" >204</ int > </lst> <result name= "response" numFound= "1000000" start= "0" > <doc> <str name= "id" >1</str> </doc> </result> <lst name= "efacet_counts" > <lst name= "efacet_fields" > <lst name= "path_ss" > <str name= "field" >path_ss</str> <lst name= "paths" > < long name= "recursivecount" >1000000</ long > < long name= "potentialtags" >1000000</ long > < long name= "totaltags" >101</ long > < long name= "count" >101</ long > < int name= "level" >0</ int > <lst name= "sub" > <lst name= "L0_T1" > < int name= "count" >1</ int > <lst name= "sub" > < long name= "recursivecount" >9901</ long > < long name= "potentialtags" >9901</ long > < long name= "totaltags" >103</ long > < long name= "count" >103</ long > < int name= "level" >1</ int > <lst name= "sub" > <lst name= "L1_T1" > < int name= "count" >1</ int > <lst name= "sub" > < long name= "recursivecount" >97</ long > < long name= "potentialtags" >97</ long > < long name= "totaltags" >97</ long > < long name= "count" >97</ long > < int name= "level" >2</ int > <lst name= "sub" > <lst name= "L2_T1" > < int name= "count" >1</ int > </lst> ... I'm currently doing some performance (memory and speed) comparisons of SOLR-64 , SOLR-792 and SOLR-2412 , which will be added later.
        Hide
        Toke Eskildsen added a comment -

        LUCENE-2369 is updated to trunk@1145556 (2011-07-13). It contains the Solr patch too. Patch using the instructions at LUCENE-2369 and follow the instructions in solr/contrib/exposed/README.txt

        Show
        Toke Eskildsen added a comment - LUCENE-2369 is updated to trunk@1145556 (2011-07-13). It contains the Solr patch too. Patch using the instructions at LUCENE-2369 and follow the instructions in solr/contrib/exposed/README.txt
        Hide
        Toke Eskildsen added a comment -

        Updated patch to work with Solr 4 Beta. Apply it to a checkout from
        https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_0_0_BETA/ and follow the README.txt in solr/contrib/exposed/

        The patch does not yet work with later Solr 4 versions, due to API changes.

        Show
        Toke Eskildsen added a comment - Updated patch to work with Solr 4 Beta. Apply it to a checkout from https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_0_0_BETA/ and follow the README.txt in solr/contrib/exposed/ The patch does not yet work with later Solr 4 versions, due to API changes.
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12548704 ]
        Toke Eskildsen made changes -
        Affects Version/s 4.0 [ 12322551 ]
        Affects Version/s 4.0-ALPHA [ 12314992 ]
        Hide
        Toke Eskildsen added a comment -

        Updated patch to work with Solr 4. Apply it to a checkout from
        https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_0_0/ and follow the README.txt in solr/contrib/exposed/

        Show
        Toke Eskildsen added a comment - Updated patch to work with Solr 4. Apply it to a checkout from https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_0_0/ and follow the README.txt in solr/contrib/exposed/
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12555771 ]
        Hide
        Toke Eskildsen added a comment -

        Forgot to add ant build files to last patch. Patch updated & tested.

        Show
        Toke Eskildsen added a comment - Forgot to add ant build files to last patch. Patch updated & tested.
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12556088 ]
        Hide
        Mark Miller added a comment -

        Great stuff! I wish I wasn't so backlogged and this out of my area of expertise, I'd love to help get it in.

        Show
        Mark Miller added a comment - Great stuff! I wish I wasn't so backlogged and this out of my area of expertise, I'd love to help get it in.
        Hide
        Toke Eskildsen added a comment -

        Improved base startup speed and added experimental optional high-speed high-mem structure builder. Patch only tested with Solr 4.0.

        Show
        Toke Eskildsen added a comment - Improved base startup speed and added experimental optional high-speed high-mem structure builder. Patch only tested with Solr 4.0.
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12576555 ]
        Hide
        Toke Eskildsen added a comment -

        Updated patch to Solr 4.5.1.

        Show
        Toke Eskildsen added a comment - Updated patch to Solr 4.5.1.
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12610805 ]
        Hide
        J.L. Hill added a comment -

        ant run-example fails for me using solr-4.5.1-src.tgz patched with 29/Oct/13 SOLR-2412.patch
        It fails with:
        /usr/local/src/solr/solr-4.5.1/solr/build.xml:373: The following error occurred while executing this line:
        /usr/local/src/solr/solr-4.5.1/solr/common-build.xml:425: The following error occurred while executing this line:
        Target "jar-exposed" does not exist in the project "solr-exposed". It is used from target "module-jars-to-solr".

        The error is perhaps mine, but the test instructions seemed rather simple. The patch applied with no warnings.

        If I have made an error in posting here, my apologies; this is my first post.

        Show
        J.L. Hill added a comment - ant run-example fails for me using solr-4.5.1-src.tgz patched with 29/Oct/13 SOLR-2412 .patch It fails with: /usr/local/src/solr/solr-4.5.1/solr/build.xml:373: The following error occurred while executing this line: /usr/local/src/solr/solr-4.5.1/solr/common-build.xml:425: The following error occurred while executing this line: Target "jar-exposed" does not exist in the project "solr-exposed". It is used from target "module-jars-to-solr". The error is perhaps mine, but the test instructions seemed rather simple. The patch applied with no warnings. If I have made an error in posting here, my apologies; this is my first post.
        Hide
        Toke Eskildsen added a comment -

        Patch updated to Solr 4.6.1 and verified (patching, executing 'ant run-example', running the sample script, indexing the output and inspecting the result in a browser) on a clean SVN checkout.

        The old patch did not have properly updated build scripts. My apologies to J.L. Hill and others that might have tried applying it.

        Show
        Toke Eskildsen added a comment - Patch updated to Solr 4.6.1 and verified (patching, executing 'ant run-example', running the sample script, indexing the output and inspecting the result in a browser) on a clean SVN checkout. The old patch did not have properly updated build scripts. My apologies to J.L. Hill and others that might have tried applying it.
        Toke Eskildsen made changes -
        Attachment SOLR-2412.patch [ 12638053 ]
        Hide
        J.L. Hill added a comment -

        Thank you - that worked.
        I appreciate the effort. Now I just have to try and understand/test it.

        Show
        J.L. Hill added a comment - Thank you - that worked. I appreciate the effort. Now I just have to try and understand/test it.
        Hide
        Toke Eskildsen added a comment -

        If the README in solr/contrib/exposed/ does not help, I will be happy to answer any questions and try to explain it better.

        Show
        Toke Eskildsen added a comment - If the README in solr/contrib/exposed/ does not help, I will be happy to answer any questions and try to explain it better.
        J.L. Hill made changes -
        Comment [ Any suggestions on parsing the output?
        The hierarchical faceting seems to be working as described, but in my test 3-level hierarchy, converting the query xml to a php array, it comes out being 13-levels deep. Parsing is complicated by the text strings of facet field being keys in the array. I have built a basic recursive function to parse the output, but after three days of trying, I am thinking there must be a better way to my desired result, which would be like a common html <ul> tree display. For a failed example of what I am trying:

        Any suggestions appreciated.
        ]
        Hide
        J.L. Hill added a comment - - edited

        After spending the past few days on this, I am a bit stuck on how to limit the facets returned to a sublevel of the facets. For example, from the example above, only returning the facets L1_T1 and those below it. From normal faceting, I think it would be done via facet.prefix=L0_T1/L1_T1
        I tried facet.prefix and efacet.prefix.
        Additionally, am I correct that the number of documents matching a facet field are to be in the "count" field/key (in standard faceting, it is with the facet field name)? The count seems not to match, but I am still testing.
        Thanks in advance.

        Show
        J.L. Hill added a comment - - edited After spending the past few days on this, I am a bit stuck on how to limit the facets returned to a sublevel of the facets. For example, from the example above, only returning the facets L1_T1 and those below it. From normal faceting, I think it would be done via facet.prefix=L0_T1/L1_T1 I tried facet.prefix and efacet.prefix. Additionally, am I correct that the number of documents matching a facet field are to be in the "count" field/key (in standard faceting, it is with the facet field name)? The count seems not to match, but I am still testing. Thanks in advance.
        Hide
        Toke Eskildsen added a comment -

        SOLR-2412 is not feature complete with standard Solr faceting. prefix is one of the things missing, so you will have to emulate it with a standard filter for path_ss:L0_T1/L1_T1. This will of course affect your overall search result, which might be undesirable.

        The count is not for the number of documents, but for the number of tags at the given level (thinking about it, this seems to be a problem). If a document has "foo/bar" and "foo/baz", and the search only hits that document, the count for level 0 will be 2. If there are only a single path per document, the count should match the number of documents.

        Show
        Toke Eskildsen added a comment - SOLR-2412 is not feature complete with standard Solr faceting. prefix is one of the things missing, so you will have to emulate it with a standard filter for path_ss:L0_T1/L1_T1. This will of course affect your overall search result, which might be undesirable. The count is not for the number of documents, but for the number of tags at the given level (thinking about it, this seems to be a problem). If a document has "foo/bar" and "foo/baz", and the search only hits that document, the count for level 0 will be 2. If there are only a single path per document, the count should match the number of documents.
        Hide
        J.L. Hill added a comment -

        Thank you for your reply. I have been testing by just filtering out the
        unwanted facets through php before outputting the html; not elegant but
        functional.
        The count should not be a real issue; I just wanted to verify. I have the
        same issue with my current hierarchy system using mysql in production.
        I have found no other issues with SOLR-2412 after a week of testing. I will
        do more testing when I have time, and then probably put it in production.
        Thanks again.

        Show
        J.L. Hill added a comment - Thank you for your reply. I have been testing by just filtering out the unwanted facets through php before outputting the html; not elegant but functional. The count should not be a real issue; I just wanted to verify. I have the same issue with my current hierarchy system using mysql in production. I have found no other issues with SOLR-2412 after a week of testing. I will do more testing when I have time, and then probably put it in production. Thanks again.
        Hide
        SMS Chauhan added a comment -

        This is pretty useful. Do we have a time frame in which this would eventually be available in a stable release?

        Show
        SMS Chauhan added a comment - This is pretty useful. Do we have a time frame in which this would eventually be available in a stable release?
        Hide
        Toke Eskildsen added a comment -

        Frankly, I am not sure it ever will. SOLR-2412 is huge and it is a completely separate facet implementation, of which Solr already has too many. We are not currently using it at my organization as we don't have the need for hierarchical faceting and since SOLR-5894 gives us a similar speed-boost when using multiple facets.

        I hope to add the hierarchical capabilities as overlay to the existing Solr facet code at some point, but I really cannot say when or if that will work out.

        Sorry about that and apologies for taking so long to come to that realization.

        Show
        Toke Eskildsen added a comment - Frankly, I am not sure it ever will. SOLR-2412 is huge and it is a completely separate facet implementation, of which Solr already has too many. We are not currently using it at my organization as we don't have the need for hierarchical faceting and since SOLR-5894 gives us a similar speed-boost when using multiple facets. I hope to add the hierarchical capabilities as overlay to the existing Solr facet code at some point, but I really cannot say when or if that will work out. Sorry about that and apologies for taking so long to come to that realization.

          People

          • Assignee:
            Unassigned
            Reporter:
            Toke Eskildsen
          • Votes:
            4 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:

              Development