Lucene - Core
  1. Lucene - Core
  2. LUCENE-1812

Static index pruning by in-document term frequency (Carmel pruning)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance.

      Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).

      As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values.

      Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching.

      NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id.

      Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.

      A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

      1. pruning.patch
        92 kB
        Doron Cohen
      2. pruning.patch
        89 kB
        Doron Cohen
      3. pruning.patch
        80 kB
        Doron Cohen
      4. pruning.patch
        59 kB
        Andrzej Bialecki
      5. pruning.patch
        54 kB
        Andrzej Bialecki
      6. pruning.patch
        30 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Marking this resolved: I created LUCENE-3917 for the 4.x port for JIRA organization purposes.

          Show
          Robert Muir added a comment - Marking this resolved: I created LUCENE-3917 for the 4.x port for JIRA organization purposes.
          Hide
          Robert Muir added a comment -

          This is really still open I think for the 4.x port.

          To eliminate confusion: I'll mark this resolved and create a 4.0 issue to port pruning to trunk APIs.

          Show
          Robert Muir added a comment - This is really still open I think for the 4.x port. To eliminate confusion: I'll mark this resolved and create a 4.0 issue to port pruning to trunk APIs.
          Hide
          Doron Cohen added a comment -

          while merging to trunk I noticed that idea's settings for modules/queries and modules/queryparser refer to lucene/contrib instead of modules. Seems trivial to fix but I have no Idea installed at the moment so no way to verify. Created LUCENE-3737 to handle that later.

          Show
          Doron Cohen added a comment - while merging to trunk I noticed that idea's settings for modules/queries and modules/queryparser refer to lucene/contrib instead of modules. Seems trivial to fix but I have no Idea installed at the moment so no way to verify. Created LUCENE-3737 to handle that later.
          Hide
          Doron Cohen added a comment -

          Excellent, thanks for seeing this through!

          Yeah, only more than a year delay

          BTW in trunk it will be under modules.

          Show
          Doron Cohen added a comment - Excellent, thanks for seeing this through! Yeah, only more than a year delay BTW in trunk it will be under modules.
          Hide
          Andrzej Bialecki added a comment -

          Excellent, thanks for seeing this through!

          Show
          Andrzej Bialecki added a comment - Excellent, thanks for seeing this through!
          Hide
          Doron Cohen added a comment -

          That dead code was removed and some javadocs added.
          Still room for more javadocs - e.g. the static tool - and better test coverage.
          Committed to 3x: r1237937.

          Show
          Doron Cohen added a comment - That dead code was removed and some javadocs added. Still room for more javadocs - e.g. the static tool - and better test coverage. Committed to 3x: r1237937.
          Hide
          Doron Cohen added a comment -

          Updated patch: package.html and all pruning classes moved to another package, except for PruningReader. Now ant javadocs-all passes as well. There are 3 TODO's:

          1. implement CarmelTermPruningDeltaTopPolicy
          2. dead code question in CarmelUniformTermPruningPolicy
          3. missing details in package.html

          The first one can wait but the other two I would like to handle before committing.

          Show
          Doron Cohen added a comment - Updated patch: package.html and all pruning classes moved to another package, except for PruningReader. Now ant javadocs-all passes as well. There are 3 TODO's: implement CarmelTermPruningDeltaTopPolicy dead code question in CarmelUniformTermPruningPolicy missing details in package.html The first one can wait but the other two I would like to handle before committing.
          Hide
          Doron Cohen added a comment -

          I ran 'javadocs' under 3x/lucene/contrib/pruning and 'javadocs-all' under 3x/lucene.

          The latter failed due to multiple package.html under o.a.l.index - in core and under contrib/pruning.

          Entirely renaming the package to o.a.l.pruning.index won't work because PruningReader accesses package protected SegmentTermVector.

          I can move the other classes to that new package and keep only PruningReader in that "index friend" package. (Unless there are javadoc/ant tricks that will avoid this error and still generate valid javadocs in both cases).

          Show
          Doron Cohen added a comment - I ran 'javadocs' under 3x/lucene/contrib/pruning and 'javadocs-all' under 3x/lucene. The latter failed due to multiple package.html under o.a.l.index - in core and under contrib/pruning. Entirely renaming the package to o.a.l.pruning.index won't work because PruningReader accesses package protected SegmentTermVector. I can move the other classes to that new package and keep only PruningReader in that "index friend" package. (Unless there are javadoc/ant tricks that will avoid this error and still generate valid javadocs in both cases).
          Hide
          Doron Cohen added a comment -

          I didn't test them, but I will once they have been committed.

          Great, thanks!

          Show
          Doron Cohen added a comment - I didn't test them, but I will once they have been committed. Great, thanks!
          Hide
          Steve Rowe added a comment -

          Hi Doron,

          I modified for Idea and maven by following templates for other contrib components but have no way to test this and would appreciate a review of this.

          I looked at these configurations and they should be functional. I didn't test them, but I will once they have been committed.

          Show
          Steve Rowe added a comment - Hi Doron, I modified for Idea and maven by following templates for other contrib components but have no way to test this and would appreciate a review of this. I looked at these configurations and they should be functional. I didn't test them, but I will once they have been committed.
          Hide
          Doron Cohen added a comment -

          I now see that all other contrib components have svn:ignore for *.iml and pom.xml - I'll add that for pruning as well (though it is not in the attached patch).

          Show
          Doron Cohen added a comment - I now see that all other contrib components have svn:ignore for *.iml and pom.xml - I'll add that for pruning as well (though it is not in the attached patch).
          Hide
          Doron Cohen added a comment -

          Getting to this, at last.

          I did not handle the above TODO's and I rather commit so they can be handled later separately ("progress not perfection" as Mike says).

          Changes in this patch:

          • PruningReader overrides also getSequentialSubReaders(), otherwise no pruning takes place on sub-readers (and tests fail).
          • StorePruningPolicy fixed to use FieldInfos API.

          I modified for Idea and maven by following templates for other contrib components but have no way to test this and would appreciate a review of this.

          Show
          Doron Cohen added a comment - Getting to this, at last. I did not handle the above TODO's and I rather commit so they can be handled later separately ("progress not perfection" as Mike says). Changes in this patch: PruningReader overrides also getSequentialSubReaders(), otherwise no pruning takes place on sub-readers (and tests fail). StorePruningPolicy fixed to use FieldInfos API. I modified for Idea and maven by following templates for other contrib components but have no way to test this and would appreciate a review of this.
          Hide
          Doron Cohen added a comment -

          Updated patch for current 3x.

          Show
          Doron Cohen added a comment - Updated patch for current 3x.
          Hide
          luo added a comment -

          where can i download the codes about the pruning ,i can't found in
          https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x

          thanks

          Show
          luo added a comment - where can i download the codes about the pruning ,i can't found in https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x thanks
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Andrzej Bialecki added a comment -

          Doron, feel free to work on this - I won't be able to do any work on this in January.

          Show
          Andrzej Bialecki added a comment - Doron, feel free to work on this - I won't be able to do any work on this in January.
          Hide
          Doron Cohen added a comment -

          I plan to work on this for 3.x and trunk, did not notice that it was marked for 2.9, but can look at that as well.
          Andrej are you working on this or can I take it?
          As for writing in 2.9 format I think it is better to backport to 2.9 rather than having the 3.x (or trunk) version write in 2.9 format?
          Doron

          Show
          Doron Cohen added a comment - I plan to work on this for 3.x and trunk, did not notice that it was marked for 2.9, but can look at that as well. Andrej are you working on this or can I take it? As for writing in 2.9 format I think it is better to backport to 2.9 rather than having the 3.x (or trunk) version write in 2.9 format? Doron
          Hide
          Kaktu Chakarabati added a comment -

          Hey,
          While trying to use this (wonderful!) component I noticed few things that might require some work:

          1. The issue says this affects lucene 2.9 as well, however the code seems to be hard-coded for 3.0 (uses the LUCENE_30 constant, as well as some new API's such as IndexWriterConfig).
          I created a patch that'll make it work with 2.9.3 (so I can use it with a Solr 1.4.1 deployment), and I can post it as a patch if seems useful, but I suspect we might want to come up with a more generic solution as well
          as clear definition of supported versions. Personally I think will be very useful to have a backport for 2.9.x so that users of current stable Solr release can use it (1.4.x)

          2. The code does not compile with the trunk (lucene/solr 4.0). Is this known issue? something we wish to solve?

          3. When using it with the 3.0 branch, it does indeed work, However when it reads an older version of the index and emits a newer one (e.g reads in 2.9.x, spits out 3.x) it renders the pruned index unusable by some platforms (e.g solr 1.4.x as mentioned above). Is this something that can be fixed? i.e forcing output index to be same version as input one? I was gonna do some work on my own there but this issue seems alittle more delicate and requires deeper understanding of lucene innards than i afford..

          -Chak

          Show
          Kaktu Chakarabati added a comment - Hey, While trying to use this (wonderful!) component I noticed few things that might require some work: 1. The issue says this affects lucene 2.9 as well, however the code seems to be hard-coded for 3.0 (uses the LUCENE_30 constant, as well as some new API's such as IndexWriterConfig). I created a patch that'll make it work with 2.9.3 (so I can use it with a Solr 1.4.1 deployment), and I can post it as a patch if seems useful, but I suspect we might want to come up with a more generic solution as well as clear definition of supported versions. Personally I think will be very useful to have a backport for 2.9.x so that users of current stable Solr release can use it (1.4.x) 2. The code does not compile with the trunk (lucene/solr 4.0). Is this known issue? something we wish to solve? 3. When using it with the 3.0 branch, it does indeed work, However when it reads an older version of the index and emits a newer one (e.g reads in 2.9.x, spits out 3.x) it renders the pruned index unusable by some platforms (e.g solr 1.4.x as mentioned above). Is this something that can be fixed? i.e forcing output index to be same version as input one? I was gonna do some work on my own there but this issue seems alittle more delicate and requires deeper understanding of lucene innards than i afford.. -Chak
          Hide
          Andrzej Bialecki added a comment -

          Renamed some of TermPruningPolicy methods for better readability - well, to me

          I agree, this looks clearer now.

          Bug in CarmelTermPruningPolicy in initTermPositions() - it sorts by docid before selecting the top subset, but in fact this seems dead code? Added a "TODO deadcode" there, maybe I am missing something.

          Well spotted, indeed this section was not needed.

          Simplified the if statements in PruningRedaer.PruningTermEnum.next() - hopefully not missing something t here...

          Looks good to me. Thanks!

          Okay, if there are no further comments I'd like to commit this soon.

          Show
          Andrzej Bialecki added a comment - Renamed some of TermPruningPolicy methods for better readability - well, to me I agree, this looks clearer now. Bug in CarmelTermPruningPolicy in initTermPositions() - it sorts by docid before selecting the top subset, but in fact this seems dead code? Added a "TODO deadcode" there, maybe I am missing something. Well spotted, indeed this section was not needed. Simplified the if statements in PruningRedaer.PruningTermEnum.next() - hopefully not missing something t here... Looks good to me. Thanks! Okay, if there are no further comments I'd like to commit this soon.
          Hide
          Doron Cohen added a comment -

          Great!

          Show
          Doron Cohen added a comment - Great!
          Hide
          Andrzej Bialecki added a comment -

          ASF legal thinks this is sufficient, so fortunately a software grant is not needed and from a legal point of view we can commit it. Yay!

          Show
          Andrzej Bialecki added a comment - ASF legal thinks this is sufficient, so fortunately a software grant is not needed and from a legal point of view we can commit it. Yay!
          Hide
          Andrzej Bialecki added a comment -

          Doron, thank you very much for pushing forward this issue! I think your patch looks good, I'm still reviewing it in the light of 3.1 APIs. It's great that you added a new policy and test cases - this looks solid now.

          In the meantime however I still doubt if the JIRA checkbox is a sufficient counterweight to a possibility of a patent infringement suit against users of Lucene... I think in cases like this, where there is a known existing patent that this implementation uses, the ASF requires an explicit software grant to be made (http://www.apache.org/licenses/software-grant.txt) which would protect Lucene users from infringing on IBM's IP. I'll forward this to legal@apache.org to see what they say about it - if you can obtain such a grant without too much trouble then I'm sure we could then close this issue.

          Show
          Andrzej Bialecki added a comment - Doron, thank you very much for pushing forward this issue! I think your patch looks good, I'm still reviewing it in the light of 3.1 APIs. It's great that you added a new policy and test cases - this looks solid now. In the meantime however I still doubt if the JIRA checkbox is a sufficient counterweight to a possibility of a patent infringement suit against users of Lucene... I think in cases like this, where there is a known existing patent that this implementation uses, the ASF requires an explicit software grant to be made ( http://www.apache.org/licenses/software-grant.txt ) which would protect Lucene users from infringing on IBM's IP. I'll forward this to legal@apache.org to see what they say about it - if you can obtain such a grant without too much trouble then I'm sure we could then close this issue.
          Hide
          Doron Cohen added a comment -

          The pruning framework is pretty cool - it is quite easy to add a new pruning policy!

          Initially I planned to focus on CarmelPruningPolicy plus add the more sophisticated algorithm (tpoK) described in the paper, but eventually found myself doing more changes - Andrzej, I hope you like the changes - like some methods I renamed - otherwise please feel free to rename them back.

          Patch Details

          • Documentation changes - mainly moved things to where I thought they belong, like moving from CarmelPruning to TFPruning the general discussion that applies to any Term pruning implementation.
          • Renamed some of TermPruningPolicy methods for better readability - well, to me :) – hope you agree with the new names, othewise please feel free to change back.
          • Renamed CarmelTermPruningPolicy to CarmelTermPruningUniformPolicy. - quite a long name... but descriptive, as this an enhanced form of the "uniform" case from the paper. Modified documentation accordingly.
          • Added CarmelTermPruningTopKPolicy - this is the more sophisticated/strong form of pruning described in the paper. (Test case added.)
          • Fixed some compiler warnings (1.5, Lucene.Version..)
          • Bug (?) in CarmelTermPruningPolicy in initTermPositions() - it sorts by docid before selecting the top subset, but in fact this seems dead code? Added a "TODO deadcode" there, maybe I am missing something.
          • Enabled topK pruning through the PruningTool program (untested)
          • Simplified the if statements in PruningRedaer.PruningTermEnum.next() - hopefully not missing something t here...

          There's more to do though not sure when I'll have the cycles..:

          • Quality/performance test for the topK pruning algorithm - using LAtimes or some other judged collection.
            Or perhaps Robert can try it on that Persian test collection.
          • Add also the "Delta Pruning" policy as described in the paper
          • Junit for CarmelTermPruningUniformPolicy

          Doron

          Show
          Doron Cohen added a comment - The pruning framework is pretty cool - it is quite easy to add a new pruning policy! Initially I planned to focus on CarmelPruningPolicy plus add the more sophisticated algorithm (tpoK) described in the paper, but eventually found myself doing more changes - Andrzej, I hope you like the changes - like some methods I renamed - otherwise please feel free to rename them back. Patch Details Documentation changes - mainly moved things to where I thought they belong, like moving from CarmelPruning to TFPruning the general discussion that applies to any Term pruning implementation. Renamed some of TermPruningPolicy methods for better readability - well, to me :) – hope you agree with the new names, othewise please feel free to change back. Renamed CarmelTermPruningPolicy to CarmelTermPruningUniformPolicy. - quite a long name... but descriptive, as this an enhanced form of the "uniform" case from the paper. Modified documentation accordingly. Added CarmelTermPruningTopKPolicy - this is the more sophisticated/strong form of pruning described in the paper. (Test case added.) Fixed some compiler warnings (1.5, Lucene.Version..) Bug (?) in CarmelTermPruningPolicy in initTermPositions() - it sorts by docid before selecting the top subset, but in fact this seems dead code? Added a "TODO deadcode" there, maybe I am missing something. Enabled topK pruning through the PruningTool program (untested) Simplified the if statements in PruningRedaer.PruningTermEnum.next() - hopefully not missing something t here... There's more to do though not sure when I'll have the cycles..: Quality/performance test for the topK pruning algorithm - using LAtimes or some other judged collection. Or perhaps Robert can try it on that Persian test collection. Add also the "Delta Pruning" policy as described in the paper Junit for CarmelTermPruningUniformPolicy Doron
          Hide
          Doron Cohen added a comment -

          Hi Andrzej, I would have asked the same question Fortunately this is cleared by the following patch I am attaching next (with that little check box in the Attach-Files dialog)

          Show
          Doron Cohen added a comment - Hi Andrzej, I would have asked the same question Fortunately this is cleared by the following patch I am attaching next (with that little check box in the Attach-Files dialog)
          Hide
          Andrzej Bialecki added a comment -

          That's great news, thanks! However, now you got me thinking ... considering there is legal aspect to the matter, do we (the Apache Lucene project) need something more substantial from IBM (e.g. a statement from your IP dept.) than just your "go ahead" in a JIRA comment?

          Show
          Andrzej Bialecki added a comment - That's great news, thanks! However, now you got me thinking ... considering there is legal aspect to the matter, do we (the Apache Lucene project) need something more substantial from IBM (e.g. a statement from your IP dept.) than just your "go ahead" in a JIRA comment?
          Hide
          Doron Cohen added a comment -

          Hi Andrzej, chances seem pretty good. We were thinking about further developing the index pruning implementation, however didn't get to it, hope to, later this year. If you rather not wait for that please go ahead with the current implementation. Thanks, Doron.

          Show
          Doron Cohen added a comment - Hi Andrzej, chances seem pretty good. We were thinking about further developing the index pruning implementation, however didn't get to it, hope to, later this year. If you rather not wait for that please go ahead with the current implementation. Thanks, Doron.
          Hide
          Andrzej Bialecki added a comment -

          Doron, were you able to check on the patent situation? If there's a chance of solving this in a positive way, how long do you think this could take?

          Show
          Andrzej Bialecki added a comment - Doron, were you able to check on the patent situation? If there's a chance of solving this in a positive way, how long do you think this could take?
          Hide
          Andrzej Bialecki added a comment -

          Thank you - yes, I think we will need to wait with this. I wasn't aware of the patent when I implemented this patch, and now after reading it I have the impression that it covers the exact algorithm described in the original paper, so perhaps we should be in the clear if we focus on other methods or a modified versions of it, but of course IANAL.

          Show
          Andrzej Bialecki added a comment - Thank you - yes, I think we will need to wait with this. I wasn't aware of the patent when I implemented this patch, and now after reading it I have the impression that it covers the exact algorithm described in the original paper, so perhaps we should be in the clear if we focus on other methods or a modified versions of it, but of course IANAL.
          Hide
          Doron Cohen added a comment -

          Hi Andrzej, Robert, please note that IBM holds a patent on Lossy index compression.
          I am checking with the IP department at IBM about committing an implementation of the patent in Lucene, and will update here as soon as I know where it stands - could you hold committing this until then?

          Show
          Doron Cohen added a comment - Hi Andrzej, Robert, please note that IBM holds a patent on Lossy index compression . I am checking with the IP department at IBM about committing an implementation of the patent in Lucene, and will update here as soon as I know where it stands - could you hold committing this until then?
          Hide
          Andrzej Bialecki added a comment -

          I'm fine with reorganizing it - I originally put this into contrib/pruning to avoid polluting the contrib/misc. If we end up putting this stuff in contrib/index together with other tools then perhaps we should create sub-packages for related functionality, otherwise it would look messy.

          Show
          Andrzej Bialecki added a comment - I'm fine with reorganizing it - I originally put this into contrib/pruning to avoid polluting the contrib/misc. If we end up putting this stuff in contrib/index together with other tools then perhaps we should create sub-packages for related functionality, otherwise it would look messy.
          Hide
          Robert Muir added a comment -

          Hi Andrzej, thanks for updating the patch.

          I am curious about package organization here, do you anticipate adding some additional pruning functionality in the future that would be different than an index modification tool?

          I only ask, because looking at reorganizing our contrib area (LUCENE-2323), I've often thought that perhaps we need a "contrib/index" for all the index-related tools, instead of having various ones in "miscellaneous", and I wonder what your opinions are on that.

          In any event we could always reorganize this after this issue is resolved if thats the best thing to do, and it could temporarily be contrib/pruning, its just svn moves.

          Show
          Robert Muir added a comment - Hi Andrzej, thanks for updating the patch. I am curious about package organization here, do you anticipate adding some additional pruning functionality in the future that would be different than an index modification tool? I only ask, because looking at reorganizing our contrib area ( LUCENE-2323 ), I've often thought that perhaps we need a "contrib/index" for all the index-related tools, instead of having various ones in "miscellaneous", and I wonder what your opinions are on that. In any event we could always reorganize this after this issue is resolved if thats the best thing to do, and it could temporarily be contrib/pruning, its just svn moves.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch relative to branch_3x.

          Show
          Andrzej Bialecki added a comment - Updated patch relative to branch_3x.
          Hide
          Andrzej Bialecki added a comment -

          I'll prepare a new patch - the reason for these deficiencies is that I worked against trunk just before the generics patches were applied

          Show
          Andrzej Bialecki added a comment - I'll prepare a new patch - the reason for these deficiencies is that I worked against trunk just before the generics patches were applied
          Hide
          Uwe Schindler added a comment -

          Code seems to be Java 1.5, which is good, but I am wondering about some @SuppressWarnings e.g. in getFieldNames(). The original overriden method returns Collection<String>, if you change that to return the correct type it doesn't need SuppressWarnings. There are more places. Also if you use Collections.<Type>emptyMap() and so on, it is also type safe.

          Also we use no space after comma in Generic type parameters.

          But I like the patch, nice work!

          Show
          Uwe Schindler added a comment - Code seems to be Java 1.5, which is good, but I am wondering about some @SuppressWarnings e.g. in getFieldNames(). The original overriden method returns Collection<String>, if you change that to return the correct type it doesn't need SuppressWarnings. There are more places. Also if you use Collections.<Type>emptyMap() and so on, it is also type safe. Also we use no space after comma in Generic type parameters. But I like the patch, nice work!
          Hide
          Robert Muir added a comment -

          Default threshold of what?

          What was confusing me is that the console output always says "deleted: 0" for -impl carmel
          For -impl tf, the console output is correct.

          But looking at the resulting index (which I should have done earlier, sorry), I can see that -impl carmel does work.

          Show
          Robert Muir added a comment - Default threshold of what? What was confusing me is that the console output always says "deleted: 0" for -impl carmel For -impl tf, the console output is correct. But looking at the resulting index (which I should have done earlier, sorry), I can see that -impl carmel does work.
          Hide
          Andrzej Bialecki added a comment -

          Default threshold of what? When using the Carmel method, the threshold value should be between 0.0 - 1.0, where 1.0 means no pruning, i.e. 100% of docs are retained. I'm sorry for the confusion - the documentation should be clearer on this point.

          Show
          Andrzej Bialecki added a comment - Default threshold of what? When using the Carmel method, the threshold value should be between 0.0 - 1.0, where 1.0 means no pruning, i.e. 100% of docs are retained. I'm sorry for the confusion - the documentation should be clearer on this point.
          Hide
          Robert Muir added a comment -

          Andrzej, are you still working on the carmel policy?
          I see -conf isn't yet implemented, and I can't seem to get it to prune anything with just a default threshold... guessing its still work in progress?

          Show
          Robert Muir added a comment - Andrzej, are you still working on the carmel policy? I see -conf isn't yet implemented, and I can't seem to get it to prune anything with just a default threshold... guessing its still work in progress?
          Hide
          Andrzej Bialecki added a comment -

          Nice job, Robert - thanks! BTW, your results show an effect that was reported in the papers on this subject, namely that some metrics may actually improve, like MRR and P@10 above.

          Show
          Andrzej Bialecki added a comment - Nice job, Robert - thanks! BTW, your results show an effect that was reported in the papers on this subject, namely that some metrics may actually improve, like MRR and P@10 above.
          Hide
          Robert Muir added a comment -

          Andrzej, i tested your patch. I found two places where @override was on an interface, only problem so far.

          here are some results on the hamshahri persian test collection (I used TF method with -t 2)

          Measure Unpruned Pruned
          index size 98627KB 42339KB
          map 0.4809 0.4241
          recip_rank 0.8368 0.8393
          P5 0.6277 0.6369
          P10 0.5677 0.5785
          P15 0.5436 0.5231
          P20 0.5185 0.4969
          P30 0.4703 0.4385
          P100 0.2782 0.2440

          the queries in this corpus are somewhat general, but seems to be a nice way to reduce the index to more than half its size, still with reasonable quality.

          Show
          Robert Muir added a comment - Andrzej, i tested your patch. I found two places where @override was on an interface, only problem so far. here are some results on the hamshahri persian test collection (I used TF method with -t 2) Measure Unpruned Pruned index size 98627KB 42339KB map 0.4809 0.4241 recip_rank 0.8368 0.8393 P5 0.6277 0.6369 P10 0.5677 0.5785 P15 0.5436 0.5231 P20 0.5185 0.4969 P30 0.4703 0.4385 P100 0.2782 0.2440 the queries in this corpus are somewhat general, but seems to be a nice way to reduce the index to more than half its size, still with reasonable quality.
          Hide
          Andrzej Bialecki added a comment -

          There have been problems with PDF uploads since the recent Wiki upgrade ... I'll keep trying until it gets through in one piece. Sorry ...

          Show
          Andrzej Bialecki added a comment - There have been problems with PDF uploads since the recent Wiki upgrade ... I'll keep trying until it gets through in one piece. Sorry ...
          Hide
          Steve Rowe added a comment - - edited

          Andzrej, when I try to look at the PDF you posted on the StaticIndexPruning wiki page, Adobe Acrobat gives me the following error:

          Cannot extract the embedded font 'CAAAA+ArialMT'. Some characters may not display or print correctly.

          and the text is illegible - everything except the page titles looks like a series of dots.

          Show
          Steve Rowe added a comment - - edited Andzrej, when I try to look at the PDF you posted on the StaticIndexPruning wiki page , Adobe Acrobat gives me the following error: Cannot extract the embedded font 'CAAAA+ArialMT'. Some characters may not display or print correctly. and the text is illegible - everything except the page titles looks like a series of dots.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch against trunk/ . This patch is a major refactoring that opens way for other implementations of stored fields and postings pruning. Two policies are included in this patch - the original Carmel method, and a simple TF-based threshold method.

          Show
          Andrzej Bialecki added a comment - Updated patch against trunk/ . This patch is a major refactoring that opens way for other implementations of stored fields and postings pruning. Two policies are included in this patch - the original Carmel method, and a simple TF-based threshold method.
          Hide
          Andrzej Bialecki added a comment -

          Patch relative to the current trunk.

          Show
          Andrzej Bialecki added a comment - Patch relative to the current trunk.

            People

            • Assignee:
              Doron Cohen
              Reporter:
              Andrzej Bialecki
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development