Details

    • Lucene Fields:
      New

      Description

      With LUCENE-3174 done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu.

      Done:

      • EasyStats: contains all statistics that might be relevant for a ranking algorithm
      • EasySimilarity: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible
      • BM25: the current "mock" implementation might be OK
      • LM
      • DFR
      • The so-called Information-Based Models
      1. LUCENE-3220.patch
        8 kB
        David Mark Nemeskey
      2. LUCENE-3220.patch
        13 kB
        David Mark Nemeskey
      3. LUCENE-3220.patch
        12 kB
        David Mark Nemeskey
      4. LUCENE-3220.patch
        12 kB
        David Mark Nemeskey
      5. LUCENE-3220.patch
        9 kB
        David Mark Nemeskey
      6. LUCENE-3220.patch
        8 kB
        David Mark Nemeskey
      7. LUCENE-3220.patch
        6 kB
        David Mark Nemeskey
      8. LUCENE-3220.patch
        6 kB
        David Mark Nemeskey
      9. LUCENE-3220.patch
        36 kB
        David Mark Nemeskey
      10. LUCENE-3220.patch
        52 kB
        David Mark Nemeskey
      11. LUCENE-3220.patch
        52 kB
        David Mark Nemeskey
      12. LUCENE-3220.patch
        50 kB
        David Mark Nemeskey
      13. LUCENE-3220.patch
        46 kB
        David Mark Nemeskey
      14. LUCENE-3220.patch
        42 kB
        David Mark Nemeskey
      15. LUCENE-3220.patch
        42 kB
        David Mark Nemeskey
      16. LUCENE-3220.patch
        39 kB
        David Mark Nemeskey
      17. LUCENE-3220.patch
        31 kB
        David Mark Nemeskey
      18. LUCENE-3220.patch
        27 kB
        David Mark Nemeskey
      19. LUCENE-3220.patch
        27 kB
        David Mark Nemeskey
      20. LUCENE-3220.patch
        39 kB
        David Mark Nemeskey
      21. LUCENE-3220.patch
        4 kB
        David Mark Nemeskey
      22. LUCENE-3220.patch
        4 kB
        David Mark Nemeskey
      23. LUCENE-3220.patch
        4 kB
        David Mark Nemeskey
      24. LUCENE-3220.patch
        4 kB
        David Mark Nemeskey

        Issue Links

          Activity

          Hide
          Fis Ka added a comment -

          Hi All,

          pardon my ignorance, I'm new to this. What I need is the BM25 to implement in my current project (bachelor thesis), I'm using Lucene 3.0.2.
          Can you instruct me what do I need to do, so that I can add the bm25 to my project? Do I get a jar? or do I need to compile everything on my own?
          furthermore, do I need to re-index sources in order to have BM25 working?

          best,

          fiska

          Show
          Fis Ka added a comment - Hi All, pardon my ignorance, I'm new to this. What I need is the BM25 to implement in my current project (bachelor thesis), I'm using Lucene 3.0.2. Can you instruct me what do I need to do, so that I can add the bm25 to my project? Do I get a jar? or do I need to compile everything on my own? furthermore, do I need to re-index sources in order to have BM25 working? best, fiska
          Hide
          Robert Muir added a comment -

          Thanks David! Awesome work

          Show
          Robert Muir added a comment - Thanks David! Awesome work
          Hide
          Robert Muir added a comment -

          +1, I do think we should consider naming and stuff (I sorta like SimilarityBase but we can discuss it)... but we should just open separate issues for that after we have worked out all the technical details first, its easy to refactor naming.

          And we also want to at the same time move it into src/java, we can open a separate issue for all of this "integrate new similarities" or something. Let's close this one!

          Show
          Robert Muir added a comment - +1, I do think we should consider naming and stuff (I sorta like SimilarityBase but we can discuss it)... but we should just open separate issues for that after we have worked out all the technical details first, its easy to refactor naming. And we also want to at the same time move it into src/java, we can open a separate issue for all of this "integrate new similarities" or something. Let's close this one!
          Hide
          David Mark Nemeskey added a comment -

          Robert: Since we use LUCENE-3357 for testing & bug fixing, I propose we close this issue. If we decide to implement other methods as well, we can do it under a new issue. Or do you have something else in mind (such as to rename EasySimilarity to SimilarityBase)?

          Show
          David Mark Nemeskey added a comment - Robert: Since we use LUCENE-3357 for testing & bug fixing, I propose we close this issue. If we decide to implement other methods as well, we can do it under a new issue. Or do you have something else in mind (such as to rename EasySimilarity to SimilarityBase)?
          Hide
          Robert Muir added a comment -

          Thanks David: I committed this.

          Show
          Robert Muir added a comment - Thanks David: I committed this.
          Hide
          David Mark Nemeskey added a comment -

          Got rid of all but one nocommits.

          Show
          David Mark Nemeskey added a comment - Got rid of all but one nocommits.
          Hide
          David Mark Nemeskey added a comment -

          Added discountOverlaps to EasySimilarity.

          Show
          David Mark Nemeskey added a comment - Added discountOverlaps to EasySimilarity.
          Hide
          David Mark Nemeskey added a comment -

          Added a short explanation on the parameter for the Jelinek-Mercer method.

          Show
          David Mark Nemeskey added a comment - Added a short explanation on the parameter for the Jelinek-Mercer method.
          Hide
          David Mark Nemeskey added a comment -

          Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it.

          Two questions remain on my side:

          • the one about discountOverlaps (see above)
          • what kind of index-time boosts do people usually use? Too big a boost might cause problems if we just divide the length with it. Maybe we should take the logarithm or sth like that?
          Show
          David Mark Nemeskey added a comment - Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it. Two questions remain on my side: the one about discountOverlaps (see above) what kind of index-time boosts do people usually use? Too big a boost might cause problems if we just divide the length with it. Maybe we should take the logarithm or sth like that?
          Hide
          Robert Muir added a comment -

          Thanks, I committed your latest patch, some ideas just perusing:

          • we can move the calculations currently in decodeNormValue into the static table, this way we aren't doing these per-document multiplications and divisions... so decodeNormValue just returns the document length.
          • should easysim change its score method from score(Stats stats, float freq, byte norm) to score(Stats stats, float freq, int documentLength) ? then it could encapsulate this encoding/decoding.
          • I think we should try to factor in the index-time boost in computeNorm here if we can, e.g. just divide the document length by it? So documents with a higher index-time boost have a shorter length.
          Show
          Robert Muir added a comment - Thanks, I committed your latest patch, some ideas just perusing: we can move the calculations currently in decodeNormValue into the static table, this way we aren't doing these per-document multiplications and divisions... so decodeNormValue just returns the document length. should easysim change its score method from score(Stats stats, float freq, byte norm) to score(Stats stats, float freq, int documentLength) ? then it could encapsulate this encoding/decoding. I think we should try to factor in the index-time boost in computeNorm here if we can, e.g. just divide the document length by it? So documents with a higher index-time boost have a shorter length.
          Hide
          David Mark Nemeskey added a comment -

          Removed reflection from IBSimilarity.

          Show
          David Mark Nemeskey added a comment - Removed reflection from IBSimilarity.
          Hide
          David Mark Nemeskey added a comment - - edited

          Deleted the accidentally forgotten abstract modifier from the Distribution classes.

          Show
          David Mark Nemeskey added a comment - - edited Deleted the accidentally forgotten abstract modifier from the Distribution classes.
          Hide
          David Mark Nemeskey added a comment -

          EasySimilarity now computes norms in the same way as DefaultSimilarity.

          Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon?

          I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices.

          Show
          David Mark Nemeskey added a comment - EasySimilarity now computes norms in the same way as DefaultSimilarity. Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon? I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices.
          Hide
          Robert Muir added a comment -

          Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat?

            public byte computeNorm(FieldInvertState state) {
              final int numTerms;
              if (discountOverlaps)
                numTerms = state.getLength() - state.getNumOverlap();
              else
                numTerms = state.getLength();
              return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))));
            }
          

          for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down.

          Show
          Robert Muir added a comment - Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat? public byte computeNorm(FieldInvertState state) { final int numTerms; if (discountOverlaps) numTerms = state.getLength() - state.getNumOverlap(); else numTerms = state.getLength(); return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)))); } for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down.
          Hide
          David Mark Nemeskey added a comment -

          Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week.

          As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review?

          Show
          David Mark Nemeskey added a comment - Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week. As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review?
          Hide
          Robert Muir added a comment -

          Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq

          I think I agree with you: in the context of stats for scoring this might be the way to go, as its easier to understand.

          Show
          Robert Muir added a comment - Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq I think I agree with you: in the context of stats for scoring this might be the way to go, as its easier to understand.
          Hide
          David Mark Nemeskey added a comment -

          I think I realized what I wanted with numberOfFieldTokens. I was afraid that sumTotalTermFreq is affected by norms / index time boost / etc, and I wanted to make numberOfFieldTokens to unaffected by those (I don't know now how); only I forgot to do so.

          But if sumTotalTermFreq is really just the number of tokens in the field, I will delete one of them. Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq, but the latter is used everywhere in Lucene. May I ask your opinion on this question?

          Show
          David Mark Nemeskey added a comment - I think I realized what I wanted with numberOfFieldTokens. I was afraid that sumTotalTermFreq is affected by norms / index time boost / etc, and I wanted to make numberOfFieldTokens to unaffected by those (I don't know now how); only I forgot to do so. But if sumTotalTermFreq is really just the number of tokens in the field, I will delete one of them. Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq, but the latter is used everywhere in Lucene. May I ask your opinion on this question?
          Hide
          Robert Muir added a comment -

          Thanks David: i committed this.

          Show
          Robert Muir added a comment - Thanks David: i committed this.
          Hide
          David Mark Nemeskey added a comment -

          Fixed two of the issues you mentioned:

          • Apache license header added to all files in the similarities package;
          • cleaned up the constructor of DFRSimilarity and added a few new ones.

          I have not yet moved the NoNormalization and NoAfterEffect classes to their own files, because I feel a bit uncomfortable about the naming, since it's different from that of the other classes, e.g. NormalizationH2 vs NoNormalization.

          Show
          David Mark Nemeskey added a comment - Fixed two of the issues you mentioned: Apache license header added to all files in the similarities package; cleaned up the constructor of DFRSimilarity and added a few new ones. I have not yet moved the NoNormalization and NoAfterEffect classes to their own files, because I feel a bit uncomfortable about the naming, since it's different from that of the other classes, e.g. NormalizationH2 vs NoNormalization.
          Hide
          Robert Muir added a comment -

          Hi David, this is looking really good! The patch is quite large so what i did was:

          1. re-sync flexscoring branch to trunk
          2. commit your patch as is (i did a tiny tweak for LUCENE-3299)

          I saw a couple things we should address (full review will really mean i have to take quite a bit of time for each model!)
          But we can take care of some of this easy stuff first!

          • numberOfFieldTokens seems to be the same as sumOfTotalTF? we should only have one name for this stat i think
          • i like the idea of NoAfterAffect/NoNormalization in DFR, maybe we should make these ordinary classes, and in DFR we just don't allow null for any of the components? just thought it might look cleaner.
          • some of the files in .similarities need apache license header.
          • because we dont need the norm for averaging, maybe we should use lucene's encoding? we can pre-build the decode table like TF-IDF similarity, except our decode table is basically 1 / decode(float)^2 to give us the quantized doc length. from a practical perspective, this would mean someone could use this stuff with existing lucene indexes (once they upgrade their segments to 4.0's format), and easily switch between things without reindexing.

          if you want, you can do these things on this issue or open separate issues, whichever is easiest. but i think looking at smaller patches at this point will make iteration easier!

          Show
          Robert Muir added a comment - Hi David, this is looking really good! The patch is quite large so what i did was: re-sync flexscoring branch to trunk commit your patch as is (i did a tiny tweak for LUCENE-3299 ) I saw a couple things we should address (full review will really mean i have to take quite a bit of time for each model!) But we can take care of some of this easy stuff first! numberOfFieldTokens seems to be the same as sumOfTotalTF? we should only have one name for this stat i think i like the idea of NoAfterAffect/NoNormalization in DFR, maybe we should make these ordinary classes, and in DFR we just don't allow null for any of the components? just thought it might look cleaner. some of the files in .similarities need apache license header. because we dont need the norm for averaging, maybe we should use lucene's encoding? we can pre-build the decode table like TF-IDF similarity, except our decode table is basically 1 / decode(float)^2 to give us the quantized doc length. from a practical perspective, this would mean someone could use this stuff with existing lucene indexes (once they upgrade their segments to 4.0's format), and easily switch between things without reindexing. if you want, you can do these things on this issue or open separate issues, whichever is easiest. but i think looking at smaller patches at this point will make iteration easier!
          Hide
          David Mark Nemeskey added a comment -

          Made the score() and explain() methods in Similarity components final.

          Show
          David Mark Nemeskey added a comment - Made the score() and explain() methods in Similarity components final.
          Hide
          David Mark Nemeskey added a comment -

          Explanation added to LM models; query boost added.

          Show
          David Mark Nemeskey added a comment - Explanation added to LM models; query boost added.
          Hide
          David Mark Nemeskey added a comment -

          Added LMSimilarity so that the two LM methods have a common parent. It also defines the CollectionModel interface which computes p(w|C) in a pluggable way (and only once per query, though I am not sure this improves performance as I need a cast in score()).

          Show
          David Mark Nemeskey added a comment - Added LMSimilarity so that the two LM methods have a common parent. It also defines the CollectionModel interface which computes p(w|C) in a pluggable way (and only once per query, though I am not sure this improves performance as I need a cast in score()).
          Hide
          David Mark Nemeskey added a comment -
          • Fixed #1
          • Added a totalBoost to EasySimilarity, and a getter method – noone uses it yet
          • Added basic implementations for the Jelinek-Mercer and the Dirichlet LM methods.

          As for the last one: the implementation is very basic now, I want to factor a few things out (e.g. p(w|C) to LMStats, possibly in a pluggable way so ppl can implement it however they want). It also doesn't seem right to have the same LM method implemented twice (both as MockLMSimilarity and here), so I'll take a look to see if I can merge those two. Finally, I am wondering whether I should implement the absolute discounting method, which, according to the paper, seems inferior to the Jelinek-Mercer and Dirichlet methods. Right now I am more on the "no" side.

          Show
          David Mark Nemeskey added a comment - Fixed #1 Added a totalBoost to EasySimilarity, and a getter method – noone uses it yet Added basic implementations for the Jelinek-Mercer and the Dirichlet LM methods. As for the last one: the implementation is very basic now, I want to factor a few things out (e.g. p(w|C) to LMStats, possibly in a pluggable way so ppl can implement it however they want). It also doesn't seem right to have the same LM method implemented twice (both as MockLMSimilarity and here), so I'll take a look to see if I can merge those two. Finally, I am wondering whether I should implement the absolute discounting method, which, according to the paper, seems inferior to the Jelinek-Mercer and Dirichlet methods. Right now I am more on the "no" side.
          Hide
          Robert Muir added a comment -

          Hi David: I had some ideas on stats to simplify some of these sims:

          1. I think we can use an easier way to compute average document length: sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed by index-time-boosts, smallfloat quantization, or anything like that.
          2. To support pivoted unique normalization like lnu.ltc, I think we can solve this in a similar way: add sumDocFreq(), which is just a single long, and divide this by maxDoc. this gives us avg # of unique terms. I think terrier might have a similar stat (#postings or #pointers or something)?

          so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1 immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR.

          then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq()

          Show
          Robert Muir added a comment - Hi David: I had some ideas on stats to simplify some of these sims: I think we can use an easier way to compute average document length: sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed by index-time-boosts, smallfloat quantization, or anything like that. To support pivoted unique normalization like lnu.ltc, I think we can solve this in a similar way: add sumDocFreq(), which is just a single long, and divide this by maxDoc. this gives us avg # of unique terms. I think terrier might have a similar stat (#postings or #pointers or something)? so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1 immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR. then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq()
          Hide
          David Mark Nemeskey added a comment -
          • log2() moved from DFRSimilarity to EasySimilarity,
          • changed DFRSimilarity so that it constructor does not use reflection.
          Show
          David Mark Nemeskey added a comment - log2() moved from DFRSimilarity to EasySimilarity, changed DFRSimilarity so that it constructor does not use reflection.
          Hide
          David Mark Nemeskey added a comment -

          Fixed a few things in MockBM25Similarity.

          Show
          David Mark Nemeskey added a comment - Fixed a few things in MockBM25Similarity.
          Hide
          David Mark Nemeskey added a comment -

          Information-based model framework due to Clinchant and Gaussier added.

          Show
          David Mark Nemeskey added a comment - Information-based model framework due to Clinchant and Gaussier added.
          Hide
          David Mark Nemeskey added a comment -

          Explanation-handling added to EasySimilarity and DFRSimilarity.

          Show
          David Mark Nemeskey added a comment - Explanation-handling added to EasySimilarity and DFRSimilarity.
          Hide
          David Mark Nemeskey added a comment -

          Made the signature of EasySimilarity.score() a bit saner.

          Show
          David Mark Nemeskey added a comment - Made the signature of EasySimilarity.score() a bit saner.
          Hide
          David Mark Nemeskey added a comment -

          Implementation of the DFR framework added. Lots of nocommits, though. I things to think about:

          • lots of (float) conversions. Maybe the inner API (BasicModel, etc.) could use doubles? According to my experience, double is faster anyway, at least on 64bit architectures
          • I am not overly happy about the naming scheme, i.e. BasicModelBE, etc. Maybe we should do it the same way as in Terrier, with a basicmodel package and class names like BE?
          • A regular SimilarityProvider implementation won't play well with DFRSimilarity, in case the user wants to use several different setups. Actually, this is a problem for all similarities that have parameters (e.g. BM25 with b and k).

          Also, I think we need that NormConverter we talked earlier on irc, so that the Similarities can run on any index.

          Show
          David Mark Nemeskey added a comment - Implementation of the DFR framework added. Lots of nocommits, though. I things to think about: lots of (float) conversions. Maybe the inner API (BasicModel, etc.) could use doubles? According to my experience, double is faster anyway, at least on 64bit architectures I am not overly happy about the naming scheme, i.e. BasicModelBE, etc. Maybe we should do it the same way as in Terrier, with a basicmodel package and class names like BE? A regular SimilarityProvider implementation won't play well with DFRSimilarity, in case the user wants to use several different setups. Actually, this is a problem for all similarities that have parameters (e.g. BM25 with b and k). Also, I think we need that NormConverter we talked earlier on irc, so that the Similarities can run on any index.
          Hide
          Robert Muir added a comment -

          Just took a look, a few things that might help:

          • yes the maxdoc does not reflect deletions, but neither does things like totalTermFreq or docFreq either... so its best to not worry about deletions in the scoring and to be consistent and use the stats (e.g. maxDoc, not numDocs) that do not take deletions into account.
          • for the computeStats(TermContext... termContexts) its wierd to sum the DF across the different terms in the case? But i don't honestly have any suggestions here... maybe in this case we should make a EasyPhraseStats that computes the EasyStats for each term, so its not hiding anything or limiting anyone? and you could then do an instanceof check and have a different method like scorePhrase() that it forwards to in case its an EasyPhraseStats? In general i'm not sure how other ranking systems tend to handle this case, the phrase estimation for IDF in lucene's formula is done by summing the IDFs
          Show
          Robert Muir added a comment - Just took a look, a few things that might help: yes the maxdoc does not reflect deletions, but neither does things like totalTermFreq or docFreq either... so its best to not worry about deletions in the scoring and to be consistent and use the stats (e.g. maxDoc, not numDocs) that do not take deletions into account. for the computeStats(TermContext... termContexts) its wierd to sum the DF across the different terms in the case? But i don't honestly have any suggestions here... maybe in this case we should make a EasyPhraseStats that computes the EasyStats for each term, so its not hiding anything or limiting anyone? and you could then do an instanceof check and have a different method like scorePhrase() that it forwards to in case its an EasyPhraseStats? In general i'm not sure how other ranking systems tend to handle this case, the phrase estimation for IDF in lucene's formula is done by summing the IDFs
          Hide
          David Mark Nemeskey added a comment -

          EasySimilarity added. Lots of questions and nocommit in the code.

          Show
          David Mark Nemeskey added a comment - EasySimilarity added. Lots of questions and nocommit in the code.
          Hide
          David Mark Nemeskey added a comment -

          Done.

          Show
          David Mark Nemeskey added a comment - Done.
          Hide
          Robert Muir added a comment -

          one last thing, can we do 'numberOfFieldTokens' instead of noFieldTokens?

          then I think we can commit this as a step, should make things a lot easier for experimentation, if you are new to lucene it will make life much easier.

          Show
          Robert Muir added a comment - one last thing, can we do 'numberOfFieldTokens' instead of noFieldTokens? then I think we can commit this as a step, should make things a lot easier for experimentation, if you are new to lucene it will make life much easier.
          Hide
          David Mark Nemeskey added a comment -

          Oh, sorry, how lame of me Actually I am working now on a different machine than the one I usually do, so that's why I made those mistakes. Anyhow, I have fixed them.

          Show
          David Mark Nemeskey added a comment - Oh, sorry, how lame of me Actually I am working now on a different machine than the one I usually do, so that's why I made those mistakes. Anyhow, I have fixed them.
          Hide
          Robert Muir added a comment -

          oh two more nitpicky comments:

          • can you update the patch to use two-spaces instead of tabs? if you use eclipse, you can download this and configure this as your default codestyle: http://people.apache.org/~rmuir/Eclipse-Lucene-Codestyle.xml
          • can you also remove the @author? For legal reasons (i think actually for your protection!) we omit these from new files.
          • it might be a good idea to use the tag @lucene.experimental also for new classes: this is a template that 'ant-javadocs' replaces with "WARNING: This API is experimental and might change in incompatible ways in the next release." to tell users that its very new and not to expect precise backwards compatibility.
          Show
          Robert Muir added a comment - oh two more nitpicky comments: can you update the patch to use two-spaces instead of tabs? if you use eclipse, you can download this and configure this as your default codestyle: http://people.apache.org/~rmuir/Eclipse-Lucene-Codestyle.xml can you also remove the @author? For legal reasons (i think actually for your protection!) we omit these from new files. it might be a good idea to use the tag @lucene.experimental also for new classes: this is a template that 'ant-javadocs' replaces with "WARNING: This API is experimental and might change in incompatible ways in the next release." to tell users that its very new and not to expect precise backwards compatibility.
          Hide
          Robert Muir added a comment -

          I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.

          I've never seen one that did... I don't imagine us ever implementing this efficiently given that we support incremental indexing.

          Show
          Robert Muir added a comment - I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it. I've never seen one that did... I don't imagine us ever implementing this efficiently given that we support incremental indexing.
          Hide
          David Mark Nemeskey added a comment -
          • I was wondering about that too – actually docNo is a mistake, it should have been noDocs or noOfDocs anyway, but I guess I'll just go with numberOfDocuments.
          • I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.
          Show
          David Mark Nemeskey added a comment - I was wondering about that too – actually docNo is a mistake, it should have been noDocs or noOfDocs anyway, but I guess I'll just go with numberOfDocuments. I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.
          Hide
          Robert Muir added a comment -

          a few comments (it generally looks close to me):

          • maybe we should use 'numberOfDocuments' instead of 'docNo' and same with 'numberOfFieldTokens'? this might make the naming more clear
          • i'm worried about 'uniqueTermCount', do you know of which implementations require this? this number is not accurate if the index has more than one segment.
          Show
          Robert Muir added a comment - a few comments (it generally looks close to me): maybe we should use 'numberOfDocuments' instead of 'docNo' and same with 'numberOfFieldTokens'? this might make the naming more clear i'm worried about 'uniqueTermCount', do you know of which implementations require this? this number is not accurate if the index has more than one segment.
          Hide
          David Mark Nemeskey added a comment -

          EasyStats object added.

          Show
          David Mark Nemeskey added a comment - EasyStats object added.

            People

            • Assignee:
              David Mark Nemeskey
              Reporter:
              David Mark Nemeskey
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development