Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I thought about this after looking @ LUCENE-4353:

      AtomicReader has some sugar APIs that are over top of the flex apis (Fields, Terms, ...). But these might be a little trappy/confusing compared to 3.x.

      1. I dont think we need AtomicReader.termDocsEnum(Bits, ...) and .termPositionsEnum(Bits, ...). I also don't think we need variants that take flags here. We should simplify these to be less trappy. I think we only need (String, BytesRef) here.
      2. This means you need to use the flex apis for more expert usage: but we make this a bit too hard since we only let you get a Terms (which you must null check, then call .iterator() on, then seekExact, ...). I think it could help if we balanced this out by adding some sugar like AtomicReader.termsEnum? 3.x had a method that let you get a 'positioned termsenum'.
      1. LUCENE-4355.patch
        39 kB
        Robert Muir
      2. LUCENE-4355.patch
        20 kB
        Robert Muir

        Activity

        Hide
        Uwe Schindler added a comment -

        I would start with removing all these APIs except fields() from AtomicReader, fix all tests and then readd "useful" ones (useful for enduser).

        Show
        Uwe Schindler added a comment - I would start with removing all these APIs except fields() from AtomicReader, fix all tests and then readd "useful" ones (useful for enduser).
        Hide
        Robert Muir added a comment -

        I agree, this was mainly to start discussion about what sugar apis we should have.

        Currently its very inconsistent.

        IndexReader:

        • docFreq(Term) -> forwards to docFreq(String, BytesRef)
        • docFreq(String, BytesRef) -> (abstract: this can be seen as a sugar API)

        AtomicReader:

        • totalTermFreq(String, BytesRef) -> strange to be treated differently than docFreq, sugar to seekExact+totalTermFreq
        • terms(String) -> note that in 3.x terms() and terms(Term) are different and go to TermsEnum (unpositioned and positioned)
        • termDocsEnum(Bits, String, BytesRef) -> the Bits should be implicit in the reader. if you want your own bits use flex apis?
        • termDocsEnum(Bits, String, BytesRef, int) -> flags seems too expert
        • termPositionsEnum(Bits, String, BytesRef)
        • termPositionsEnum(Bits, String, BytesRef, int)

        So we should also discuss whether its useful to use Term at the indexReader level. If we are going to have sugar
        for docFreq(Term) then we should do this elsewhere too? Term is somewhat nice because it means users don't have
        to deal with BytesRef etc.

        I wonder if totalTermFreq sugar is necessary here too, if we instead make it easy for you to get a positioned termsenum for
        a specific term (you could just call it yourself then).

        We should also think about the names termDocsEnum/termPositionsEnum. in 3.x these were termDocs() and termPositions(),
        and could take Term.

        The only thing I feel pretty strongly about is that I think passing a custom Bits is too expert for these sugar APIs,
        as its something implicit from the Reader itself.

        Show
        Robert Muir added a comment - I agree, this was mainly to start discussion about what sugar apis we should have. Currently its very inconsistent. IndexReader: docFreq(Term) -> forwards to docFreq(String, BytesRef) docFreq(String, BytesRef) -> (abstract: this can be seen as a sugar API) AtomicReader: totalTermFreq(String, BytesRef) -> strange to be treated differently than docFreq, sugar to seekExact+totalTermFreq terms(String) -> note that in 3.x terms() and terms(Term) are different and go to TermsEnum (unpositioned and positioned) termDocsEnum(Bits, String, BytesRef) -> the Bits should be implicit in the reader. if you want your own bits use flex apis? termDocsEnum(Bits, String, BytesRef, int) -> flags seems too expert termPositionsEnum(Bits, String, BytesRef) termPositionsEnum(Bits, String, BytesRef, int) So we should also discuss whether its useful to use Term at the indexReader level. If we are going to have sugar for docFreq(Term) then we should do this elsewhere too? Term is somewhat nice because it means users don't have to deal with BytesRef etc. I wonder if totalTermFreq sugar is necessary here too, if we instead make it easy for you to get a positioned termsenum for a specific term (you could just call it yourself then). We should also think about the names termDocsEnum/termPositionsEnum. in 3.x these were termDocs() and termPositions(), and could take Term. The only thing I feel pretty strongly about is that I think passing a custom Bits is too expert for these sugar APIs, as its something implicit from the Reader itself.
        Hide
        Michael McCandless added a comment -

        Maybe we should remove all the sugar methods...? It's quite expert to pull a D/D&PEnum? But maybe stats are more commonly used?

        Show
        Michael McCandless added a comment - Maybe we should remove all the sugar methods...? It's quite expert to pull a D/D&PEnum? But maybe stats are more commonly used?
        Hide
        Robert Muir added a comment -

        I don't think I agree Mike. I think we should degrade into expert territory rather than it being a sharp cliff.
        I think we should also make migration from previous versions of lucene easier too.

        I think these apis on IR are a good way to do that. I'm tempted to suggest:

        termDocs(Term) & termPositions(Term) as the sugar postings APIs as those pretty much match the 3.x functionality.

        I'm not sure these sugar APIs should take BytesRef, thats another head explosion for someone above Term which
        is simpler and takes Strings.

        If someone is going to be calling these on lots of things anyway they can just use fields()/terms()/etc.

        We also have to realize its a lot of work to compute something like docFreq without any sugar at all,
        just look at the code to docFreq:

        final Fields fields = fields();
        if (fields == null) {
          return 0;
        }
        final Terms terms = fields.terms(field);
        if (terms == null) {
          return 0;
        }
        final TermsEnum termsEnum = terms.iterator(null);
        if (termsEnum.seekExact(term, true)) {
          return termsEnum.docFreq();
        } else {
          return 0;
        }
        

        Thats too much boilerplate and special-cases. the terms(String) sugar helps a lot here, reducing it to:

        final Terms terms = ir.terms(field);
        if (terms == null) {
          return 0;
        }
        final TermsEnum termsEnum = terms.iterator(null);
        if (termsEnum.seekExact(term, true)) {
          return termsEnum.docFreq();
        } else {
          return 0;
        }
        

        But thats still too much. Making a positioned termsenum more accessible could help with a lot
        of expert use-cases like getting enums with different Bits or flags or getting term-level stats:

        final TermsEnum te = ir.termsEnum(new Term("field", "value"));
        if (te == null) {
          return 0;
        } else {
          return te.docFreq();
        }
        

        The oddity might be that compared to 3.x, its a seekExact vs. a seekCeil. But i think thats ok,
        after all we already "backwards-broke" since terms() does something totally different than 3.x (and I think
        we should keep that, making it easy to access field-level metadata!)

        And I still think we should keep docFreq/totalTermFreq sugar!

        Show
        Robert Muir added a comment - I don't think I agree Mike. I think we should degrade into expert territory rather than it being a sharp cliff. I think we should also make migration from previous versions of lucene easier too. I think these apis on IR are a good way to do that. I'm tempted to suggest: termDocs(Term) & termPositions(Term) as the sugar postings APIs as those pretty much match the 3.x functionality. I'm not sure these sugar APIs should take BytesRef, thats another head explosion for someone above Term which is simpler and takes Strings. If someone is going to be calling these on lots of things anyway they can just use fields()/terms()/etc. We also have to realize its a lot of work to compute something like docFreq without any sugar at all, just look at the code to docFreq: final Fields fields = fields(); if (fields == null ) { return 0; } final Terms terms = fields.terms(field); if (terms == null ) { return 0; } final TermsEnum termsEnum = terms.iterator( null ); if (termsEnum.seekExact(term, true )) { return termsEnum.docFreq(); } else { return 0; } Thats too much boilerplate and special-cases. the terms(String) sugar helps a lot here, reducing it to: final Terms terms = ir.terms(field); if (terms == null ) { return 0; } final TermsEnum termsEnum = terms.iterator( null ); if (termsEnum.seekExact(term, true )) { return termsEnum.docFreq(); } else { return 0; } But thats still too much. Making a positioned termsenum more accessible could help with a lot of expert use-cases like getting enums with different Bits or flags or getting term-level stats: final TermsEnum te = ir.termsEnum( new Term( "field" , "value" )); if (te == null ) { return 0; } else { return te.docFreq(); } The oddity might be that compared to 3.x, its a seekExact vs. a seekCeil. But i think thats ok, after all we already "backwards-broke" since terms() does something totally different than 3.x (and I think we should keep that, making it easy to access field-level metadata!) And I still think we should keep docFreq/totalTermFreq sugar!
        Hide
        Michael McCandless added a comment -

        I'm OK with keeping the sugar too. I agree the boilerplate code is sizable.

        I think only taking Term, not taking Bits, keeps the docs/positions enum simple.

        Should we sugar for all stats? (eg IR.getSumTotalTermFreq(String field)).

        Show
        Michael McCandless added a comment - I'm OK with keeping the sugar too. I agree the boilerplate code is sizable. I think only taking Term, not taking Bits, keeps the docs/positions enum simple. Should we sugar for all stats? (eg IR.getSumTotalTermFreq(String field)).
        Hide
        Robert Muir added a comment -

        here's an incomplete patch just looking at IR.docFreq: found a few wierd things looking around.

        I think fixing all these sugar APIs is going to be pretty major and a lot of work.

        In general I think we can accomplish this issue in a backwards-compatible way later: sure it will mean we have deprecations but I think this is better than rushing stuff in.

        Show
        Robert Muir added a comment - here's an incomplete patch just looking at IR.docFreq: found a few wierd things looking around. I think fixing all these sugar APIs is going to be pretty major and a lot of work. In general I think we can accomplish this issue in a backwards-compatible way later: sure it will mean we have deprecations but I think this is better than rushing stuff in.
        Hide
        Michael McCandless added a comment -

        +1, patch looks good for docFreq.

        Show
        Michael McCandless added a comment - +1, patch looks good for docFreq.
        Hide
        Robert Muir added a comment -

        updated patch: sugar fixed for docsEnum/d&pEnum as proposed.

        wasn't as bad as I thought

        Show
        Robert Muir added a comment - updated patch: sugar fixed for docsEnum/d&pEnum as proposed. wasn't as bad as I thought
        Hide
        Michael McCandless added a comment -

        +1, looks great!

        Show
        Michael McCandless added a comment - +1, looks great!
        Hide
        Robert Muir added a comment -

        Thanks Mike: Ill give some time in case anyone else wants to review, but i'd like to commit this in a day or two.

        Show
        Robert Muir added a comment - Thanks Mike: Ill give some time in case anyone else wants to review, but i'd like to commit this in a day or two.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1384290

        LUCENE-4355: upgrade MIGRATE.txt (also fix a bug in field+term enumeration)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1384290 LUCENE-4355 : upgrade MIGRATE.txt (also fix a bug in field+term enumeration)
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1384286

        LUCENE-4355: improve AtomicReader sugar apis

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1384286 LUCENE-4355 : improve AtomicReader sugar apis
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development