Lucene - Core
  1. Lucene - Core
  2. LUCENE-4670

Add TermVectorsWriter.finish{Doc,Field,Term} to make development of new formats easier

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is especially useful to LUCENE-4599 where actions have to be taken after a doc/field/term has been added.

      1. LUCENE-4670.patch
        19 kB
        Adrien Grand
      2. LUCENE-4670.patch
        17 kB
        Adrien Grand
      3. LUCENE-4670.patch
        11 kB
        Robert Muir
      4. LUCENE-4670.patch
        8 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Patch, with new assertions in AssertingCodec to make sure that there is a finish call for every start call. Tests passed with -Dtest.codec=Asserting.

          Show
          Adrien Grand added a comment - Patch, with new assertions in AssertingCodec to make sure that there is a finish call for every start call. Tests passed with -Dtest.codec=Asserting.
          Hide
          Robert Muir added a comment -

          Wouldnt this be redundant?

          • startDocument(int numVectorFields <-- this tells you how many times startField will be called
          • startField(FieldInfo info, int numTerms <-- this tells you how many times startTerm will be called
          • startTerm(BytesRef term, int freq <-- this tells you how many times addPosition will be called
          Show
          Robert Muir added a comment - Wouldnt this be redundant? startDocument(int numVectorFields <-- this tells you how many times startField will be called startField(FieldInfo info, int numTerms <-- this tells you how many times startTerm will be called startTerm(BytesRef term, int freq <-- this tells you how many times addPosition will be called
          Hide
          Adrien Grand added a comment -

          Wouldnt this be redundant?

          Yes it is, but I think it can make the format easier to implement and later to understand? PostingsConsumer already does it (startDoc(docID, freq) / finishDoc).

          Show
          Adrien Grand added a comment - Wouldnt this be redundant? Yes it is, but I think it can make the format easier to implement and later to understand? PostingsConsumer already does it (startDoc(docID, freq) / finishDoc).
          Hide
          Adrien Grand added a comment -

          For example, if you want to flush data after every field has been added, today you need to do it both in the finish and startField methods, and in both cases you need to check whether startField had already been called earlier on. By having a finishField method, the modification is in one place and doesn't need an extra condition.

          Show
          Adrien Grand added a comment - For example, if you want to flush data after every field has been added, today you need to do it both in the finish and startField methods, and in both cases you need to check whether startField had already been called earlier on. By having a finishField method, the modification is in one place and doesn't need an extra condition.
          Hide
          Robert Muir added a comment -

          Terms/PostingsConsumer doesnt do it in general: e.g. startField doesn't tell you the number of terms, and startTerms
          doesnt tell you the number of documents. so they must have finish() since they are filtering deleted docs on the fly.

          For the per-document apis (Stored Fields, Term Vectors), we instead give you this number totally up-front (as it makes it easier to e.g. write numTerms into your file).

          I'm not necessarily opposed to the redundant calls, but it should then also be done with the stored fields api. And i'd like to see if it really simplifies some of our existing impls (SimpleText, Lucene40) as well.

          Finally, adding checks to AssertingCodec as a test is a good idea, however it still leaves our default merge implementation untested because the wrapped codec implements bulk merge.

          Show
          Robert Muir added a comment - Terms/PostingsConsumer doesnt do it in general: e.g. startField doesn't tell you the number of terms, and startTerms doesnt tell you the number of documents. so they must have finish() since they are filtering deleted docs on the fly. For the per-document apis (Stored Fields, Term Vectors), we instead give you this number totally up-front (as it makes it easier to e.g. write numTerms into your file). I'm not necessarily opposed to the redundant calls, but it should then also be done with the stored fields api. And i'd like to see if it really simplifies some of our existing impls (SimpleText, Lucene40) as well. Finally, adding checks to AssertingCodec as a test is a good idea, however it still leaves our default merge implementation untested because the wrapped codec implements bulk merge.
          Hide
          Robert Muir added a comment -

          For example, if you want to flush data after every field has been added, today you need to do it both in the finish and startField methods, and in both cases you need to check whether startField had already been called earlier on. By having a finishField method, the modification is in one place and doesn't need an extra condition.

          Yeah I think this currently makes Lucene40's impl confusing too: check out its startField. If we can simplify that one too, i'm completely sold.

          I still feel like we should do this to the stored fields api too though for consistency.

          Show
          Robert Muir added a comment - For example, if you want to flush data after every field has been added, today you need to do it both in the finish and startField methods, and in both cases you need to check whether startField had already been called earlier on. By having a finishField method, the modification is in one place and doesn't need an extra condition. Yeah I think this currently makes Lucene40's impl confusing too: check out its startField. If we can simplify that one too, i'm completely sold. I still feel like we should do this to the stored fields api too though for consistency.
          Hide
          Robert Muir added a comment -

          and by "if we can simplify" for the 4.0 codec, I dont necessarily mean we change the code. Its good enough for to stare at it and be able to tell if it would be simpler

          Show
          Robert Muir added a comment - and by "if we can simplify" for the 4.0 codec, I dont necessarily mean we change the code. Its good enough for to stare at it and be able to tell if it would be simpler
          Hide
          Adrien Grand added a comment -

          I still feel like we should do this to the stored fields api too though for consistency.

          Agreed.

          Its good enough for to stare at it and be able to tell if it would be simpler

          I think it would? Things like

              if (fieldCount == numVectorFields) {
                // last field of the document
                // this is crazy because the file format is crazy!
                for (int i = 1; i < fieldCount; i++) {
                  tvd.writeVLong(fps[i] - fps[i-1]);
                }
              }
          

          in startField could become

              public void finishDocument() throws IOException {
                // last field of the document
                // this is crazy because the file format is crazy!
                for (int i = 1; i < fieldCount; i++) {
                  tvd.writeVLong(fps[i] - fps[i-1]);
                }
              }
          

          It would help simplify Lucene41StoredFieldsFormat too.

          it still leaves our default merge implementation untested because the wrapped codec implements bulk merge.

          Do you have an idea how to test it?

          Show
          Adrien Grand added a comment - I still feel like we should do this to the stored fields api too though for consistency. Agreed. Its good enough for to stare at it and be able to tell if it would be simpler I think it would? Things like if (fieldCount == numVectorFields) { // last field of the document // this is crazy because the file format is crazy! for ( int i = 1; i < fieldCount; i++) { tvd.writeVLong(fps[i] - fps[i-1]); } } in startField could become public void finishDocument() throws IOException { // last field of the document // this is crazy because the file format is crazy! for ( int i = 1; i < fieldCount; i++) { tvd.writeVLong(fps[i] - fps[i-1]); } } It would help simplify Lucene41StoredFieldsFormat too. it still leaves our default merge implementation untested because the wrapped codec implements bulk merge. Do you have an idea how to test it?
          Hide
          Robert Muir added a comment -

          i think this looks better?

          I can go either way on whether or not the finish() methods should be abstract though (versus having a no-op impl).

          Show
          Robert Muir added a comment - i think this looks better? I can go either way on whether or not the finish() methods should be abstract though (versus having a no-op impl).
          Hide
          Robert Muir added a comment -

          Do you have an idea how to test it?

          I think a very simple solution would be to have two asserting codecs instead of one
          1. Asserting(Lucene41) like we have today
          2. Asserting(SimpleText).

          Its already a filtercodec, and this way we test the unoptimized default merge impls everywhere for sure,
          since simpletext never has such optimizations.

          Show
          Robert Muir added a comment - Do you have an idea how to test it? I think a very simple solution would be to have two asserting codecs instead of one 1. Asserting(Lucene41) like we have today 2. Asserting(SimpleText). Its already a filtercodec, and this way we test the unoptimized default merge impls everywhere for sure, since simpletext never has such optimizations.
          Hide
          Adrien Grand added a comment -

          i think this looks better?

          Yes it does!

          I can go either way on whether or not the finish() methods should be abstract though (versus having a no-op impl).

          A no-op impl would make the change backward compatible?

          Show
          Adrien Grand added a comment - i think this looks better? Yes it does! I can go either way on whether or not the finish() methods should be abstract though (versus having a no-op impl). A no-op impl would make the change backward compatible?
          Hide
          Robert Muir added a comment -

          I think for this case we should not consider API backwards compatibility (The codec api is experimental).

          Its far better to have the best APIs possible. It just seems confusing for finish(FieldInfos, int)
          to be abstract and the others not.

          On the other hand adding more abstract methods makes the API more overwhelming...

          Show
          Robert Muir added a comment - I think for this case we should not consider API backwards compatibility (The codec api is experimental). Its far better to have the best APIs possible. It just seems confusing for finish(FieldInfos, int) to be abstract and the others not. On the other hand adding more abstract methods makes the API more overwhelming...
          Hide
          Robert Muir added a comment -

          I think a very simple solution would be to have two asserting codecs instead of one
          1. Asserting(Lucene41) like we have today
          2. Asserting(SimpleText)

          We don't technically need this actually. Our current bulk merge impls use instanceof,
          so they should never happen when we use the Asserting codec right?

          If this is really true then non-bulk merges are already tested a lot better than I thought

          Show
          Robert Muir added a comment - I think a very simple solution would be to have two asserting codecs instead of one 1. Asserting(Lucene41) like we have today 2. Asserting(SimpleText) We don't technically need this actually. Our current bulk merge impls use instanceof, so they should never happen when we use the Asserting codec right? If this is really true then non-bulk merges are already tested a lot better than I thought
          Hide
          Adrien Grand added a comment -

          I think for this case we should not consider API backwards compatibility (The codec api is experimental).

          On the other hand adding more abstract methods makes the API more overwhelming...

          I'm fine with both options.

          We don't technically need this actually. Our current bulk merge impls use instanceof, so they should never happen when we use the Asserting codec right?

          Right.

          Show
          Adrien Grand added a comment - I think for this case we should not consider API backwards compatibility (The codec api is experimental). On the other hand adding more abstract methods makes the API more overwhelming... I'm fine with both options. We don't technically need this actually. Our current bulk merge impls use instanceof, so they should never happen when we use the Asserting codec right? Right.
          Hide
          Robert Muir added a comment -

          OK... maybe for now just go forward with the no-op impls? Its redundant anyway so if you dont "notice" that these
          extra hooks are available it won't hurt you, unless you are doing some kind of wrapping or so on (we should look for that just in case).

          I added the assertingstoredfields so we should be able to also move the 'assert fieldCount == 0' to its finishDocument()

          Show
          Robert Muir added a comment - OK... maybe for now just go forward with the no-op impls? Its redundant anyway so if you dont "notice" that these extra hooks are available it won't hurt you, unless you are doing some kind of wrapping or so on (we should look for that just in case). I added the assertingstoredfields so we should be able to also move the 'assert fieldCount == 0' to its finishDocument()
          Hide
          Adrien Grand added a comment -

          New patch with stored fields.

          Show
          Adrien Grand added a comment - New patch with stored fields.
          Hide
          Robert Muir added a comment -

          looks good. we could also add the state machine in the future to the AssertingStoredFields like the term vectors has.

          this way we ensure we don't forget to call finishDocument somewhere: currently we only check that we actually call
          writeField numStoredFields times and so on.

          Show
          Robert Muir added a comment - looks good. we could also add the state machine in the future to the AssertingStoredFields like the term vectors has. this way we ensure we don't forget to call finishDocument somewhere: currently we only check that we actually call writeField numStoredFields times and so on.
          Hide
          Adrien Grand added a comment -

          New patch with state machine for stored fields.

          Show
          Adrien Grand added a comment - New patch with state machine for stored fields.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1431283

          LUCENE-4670: Add finish* callbacks to StoredFieldsWriter and TermVectorsWriter.

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1431283 LUCENE-4670 : Add finish* callbacks to StoredFieldsWriter and TermVectorsWriter.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1431294

          LUCENE-4670: Add finish* callbacks to StoredFieldsWriter and TermVectorsWriter.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1431294 LUCENE-4670 : Add finish* callbacks to StoredFieldsWriter and TermVectorsWriter.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development