Lucene - Core
  1. Lucene - Core
  2. LUCENE-4733

Make CompressingTermVectorsFormat the new default term vectors format?

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-4599, I wrote an alternate term vectors format which has a more compact format, and I think it could replace the current Lucene40TermVectorsFormat for the next (4.2) release?

      1. LUCENE-4733-javadocs.patch
        21 kB
        Adrien Grand
      2. LUCENE-4733-tests.patch
        56 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          It would be great if someone could set up a Jenkins job to run tests with -Dtests.codec=Compressing to catch bugs of this new format. Otherwise, the following points need to be addressed (I'll take care of these):

          • write file format docs,
          • wire this format in Lucene42Codec (I'll wait for the lucene4547 to be merged back into trunk for this one I think),
          • write tests for corner cases (small/large number of fields/terms/positions/offsets/payloads, etc.),
          • find good default params (chunk size / compression alg).

          Feel free to let me know if you think it is a bad idea.

          Show
          Adrien Grand added a comment - It would be great if someone could set up a Jenkins job to run tests with -Dtests.codec=Compressing to catch bugs of this new format. Otherwise, the following points need to be addressed (I'll take care of these): write file format docs, wire this format in Lucene42Codec (I'll wait for the lucene4547 to be merged back into trunk for this one I think), write tests for corner cases (small/large number of fields/terms/positions/offsets/payloads, etc.), find good default params (chunk size / compression alg). Feel free to let me know if you think it is a bad idea.
          Hide
          Robert Muir added a comment -

          I think its a good idea! About the corner cases:

          If we are going to add new tests anyway, maybe we could fashion it as a base class like BaseDocValuesFormat/BasePostingsFormat?
          Each TVFormat we have (4.0, simpletext, 4.2, etc) would extend this base class and just return getCodec().

          Any tests in TestTermVectors/Reader/Writer today that are testing codecs (and not really indexwriter tests) could be folded in.

          Of course if any tests are *compressing specific they can just stay in its folder too.

          Show
          Robert Muir added a comment - I think its a good idea! About the corner cases: If we are going to add new tests anyway, maybe we could fashion it as a base class like BaseDocValuesFormat/BasePostingsFormat? Each TVFormat we have (4.0, simpletext, 4.2, etc) would extend this base class and just return getCodec(). Any tests in TestTermVectors/Reader/Writer today that are testing codecs (and not really indexwriter tests) could be folded in. Of course if any tests are *compressing specific they can just stay in its folder too.
          Hide
          Adrien Grand added a comment -

          Step 1: Patch that adds a new BaseTermVectorsFormatTestCase (extended for SimpleText, Lucene40 and Compressing). I'll commit soon if nobody objects.

          Show
          Adrien Grand added a comment - Step 1: Patch that adds a new BaseTermVectorsFormatTestCase (extended for SimpleText, Lucene40 and Compressing). I'll commit soon if nobody objects.
          Hide
          Robert Muir added a comment -

          +1: though I think we need a little javadoc sentence on that since its in test-framework or the build will complain.

          we can just describe its ideal purpose is to be the "one test" for your format like the postings/docvalues equivalents.

          Show
          Robert Muir added a comment - +1: though I think we need a little javadoc sentence on that since its in test-framework or the build will complain. we can just describe its ideal purpose is to be the "one test" for your format like the postings/docvalues equivalents.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1441367

          LUCENE-4733: Refactor term vectors formats tests around a BaseTermVectorsFormatTestCase.

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1441367 LUCENE-4733 : Refactor term vectors formats tests around a BaseTermVectorsFormatTestCase.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1441379

          LUCENE-4733: Refactor term vectors formats tests around a BaseTermVectorsFormatTestCase (merged from r1441367).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1441379 LUCENE-4733 : Refactor term vectors formats tests around a BaseTermVectorsFormatTestCase (merged from r1441367).
          Hide
          Adrien Grand added a comment -

          New patch that adds Lucene42TermVectorsFormat with file format javadocs.

          I have set chunkSize=4KB as it seemed to be a good compromise between compression and speed in LUCENE-4599 (this chunk size only accounts for terms and payloads). Moreover, it uses LZ4 compression (which is very light) so that compression/decompression is not the bottleneck even for small indexes which fit into memory.

          Show
          Adrien Grand added a comment - New patch that adds Lucene42TermVectorsFormat with file format javadocs. I have set chunkSize=4KB as it seemed to be a good compromise between compression and speed in LUCENE-4599 (this chunk size only accounts for terms and payloads). Moreover, it uses LZ4 compression (which is very light) so that compression/decompression is not the bottleneck even for small indexes which fit into memory.
          Hide
          Adrien Grand added a comment -

          I forgot to mention that the patch adds the format, but no codec. Should we wait for lucene4547 to be merged back into trunk and then just change the term vectors format or should we add the new Lucene42Codec in trunk without taking care of the lucene4547 branch? I guess it depends on how soon this branch is going to land on trunk?

          Show
          Adrien Grand added a comment - I forgot to mention that the patch adds the format, but no codec. Should we wait for lucene4547 to be merged back into trunk and then just change the term vectors format or should we add the new Lucene42Codec in trunk without taking care of the lucene4547 branch? I guess it depends on how soon this branch is going to land on trunk?
          Hide
          Robert Muir added a comment -

          unrelated branches shouldn't delay trunk development.

          let me try to merge the relevant commits for 4.2 codec (and 4.1 impersonator and so on) to trunk...

          Show
          Robert Muir added a comment - unrelated branches shouldn't delay trunk development. let me try to merge the relevant commits for 4.2 codec (and 4.1 impersonator and so on) to trunk...
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1441571

          LUCENE-4733: merge Lucene42 codec from lucene-4547 branch

          Show
          Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1441571 LUCENE-4733 : merge Lucene42 codec from lucene-4547 branch
          Hide
          Robert Muir added a comment -

          ok its done: have fun!

          Show
          Robert Muir added a comment - ok its done: have fun!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1441578

          LUCENE-4733: merge Lucene42 codec from lucene-4547 branch

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1441578 LUCENE-4733 : merge Lucene42 codec from lucene-4547 branch
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1441760

          LUCENE-4733: Make CompressingTermVectorsFormat the new default term vectors format (merged from r1441732).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1441760 LUCENE-4733 : Make CompressingTermVectorsFormat the new default term vectors format (merged from r1441732).
          Hide
          Adrien Grand added a comment -

          ok its done: have fun!

          I will! Thanks, Robert!

          Show
          Adrien Grand added a comment - ok its done: have fun! I will! Thanks, Robert!
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1441761

          LUCENE-4733: More term vectors formats tests (TVReader.clone and TVWriter.merge).

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1441761 LUCENE-4733 : More term vectors formats tests (TVReader.clone and TVWriter.merge).
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1441763

          LUCENE-4733: More term vectors formats tests (TVReader.clone and TVWriter.merge).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1441763 LUCENE-4733 : More term vectors formats tests (TVReader.clone and TVWriter.merge).
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development