Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4051

Fix File Headers for Lucene40 StoredFields & TermVectors

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently we still write the "old" file header format in Lucene40StoredFieldFormat & Lucene40TermVectorsFormat. We should cut over to use CodecUtil and reset the versioning before we release Lucene 4.0

      1. LUCENE-4051.patch
        42 kB
        Simon Willnauer
      2. LUCENE-4051.patch
        41 kB
        Simon Willnauer
      3. LUCENE-4051.patch
        37 kB
        Simon Willnauer
      4. LUCENE-4051.patch
        20 kB
        Simon Willnauer
      5. LUCENE-4051.patch
        21 kB
        Simon Willnauer

        Activity

        Hide
        simonw Simon Willnauer added a comment -

        here is a first patch

        Show
        simonw Simon Willnauer added a comment - here is a first patch
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        rcmuir Robert Muir added a comment -

        By the way: we can use CodecUtil.headerLength to have a static final
        for the length of the header (rather than getFilePointer).

        I think in general we should do this since the codec name (and thus
        header length) is still fixed.

        Show
        rcmuir Robert Muir added a comment - By the way: we can use CodecUtil.headerLength to have a static final for the length of the header (rather than getFilePointer). I think in general we should do this since the codec name (and thus header length) is still fixed.
        Hide
        simonw Simon Willnauer added a comment -

        By the way: we can use CodecUtil.headerLength to have a static final

        for the length of the header (rather than getFilePointer).

        new patch using a static HEADER_LENGTH for this. thanks robert for the pointer.

        Show
        simonw Simon Willnauer added a comment - By the way: we can use CodecUtil.headerLength to have a static final for the length of the header (rather than getFilePointer). new patch using a static HEADER_LENGTH for this. thanks robert for the pointer.
        Hide
        rcmuir Robert Muir added a comment -

        One more comment: i think actually we should use different codecnames for each file.

        (Like for .fdt versus .fdx and so on for the 3 term vectors files). This way we are
        validating that we actually read the stream we think we are reading.

        Show
        rcmuir Robert Muir added a comment - One more comment: i think actually we should use different codecnames for each file. (Like for .fdt versus .fdx and so on for the 3 term vectors files). This way we are validating that we actually read the stream we think we are reading.
        Hide
        simonw Simon Willnauer added a comment -

        here is a new patch adding per file headers (while reusing the version). I also cut over the docvalues headers to be per file.

        I think this is a sep. issues but if we can batch up File Format changes I think we should do it. I will look for other codecs to get this consistent where possible.

        Show
        simonw Simon Willnauer added a comment - here is a new patch adding per file headers (while reusing the version). I also cut over the docvalues headers to be per file. I think this is a sep. issues but if we can batch up File Format changes I think we should do it. I will look for other codecs to get this consistent where possible.
        Hide
        rcmuir Robert Muir added a comment -

        Patch looks good!

        Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header?
        In the other places we just treat it like a primitive type and link it to CodecUtil#writeHeader where its
        format is described in detail.

        Show
        rcmuir Robert Muir added a comment - Patch looks good! Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header? In the other places we just treat it like a primitive type and link it to CodecUtil#writeHeader where its format is described in detail.
        Hide
        simonw Simon Willnauer added a comment -

        Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header?

        absolutely. this is still WIP since I need to check other files if we can / should make it writing a per-file header.

        Show
        simonw Simon Willnauer added a comment - Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header? absolutely. this is still WIP since I need to check other files if we can / should make it writing a per-file header.
        Hide
        simonw Simon Willnauer added a comment -

        update codec format documenation. I think its ready...

        Show
        simonw Simon Willnauer added a comment - update codec format documenation. I think its ready...
        Hide
        rcmuir Robert Muir added a comment -

        javadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as:

        * <p>Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums,
        

        I think TVDVersion should just say Header (i noticed this for TVF etc too)

        Show
        rcmuir Robert Muir added a comment - javadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as: * <p>Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, I think TVDVersion should just say Header (i noticed this for TVF etc too)
        Hide
        simonw Simon Willnauer added a comment -

        avadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as:

        good that you reviewed it! I missed those! here is an updated patch. I plan to commit this soon...

        Show
        simonw Simon Willnauer added a comment - avadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as: good that you reviewed it! I missed those! here is an updated patch. I plan to commit this soon...
        Hide
        rcmuir Robert Muir added a comment -

        +1 to commit, thanks for working this! Lets get this in asap. as soon as file formats
        changes are solidified we can start working towards a release candidate for 4.0 alpha

        Show
        rcmuir Robert Muir added a comment - +1 to commit, thanks for working this! Lets get this in asap. as soon as file formats changes are solidified we can start working towards a release candidate for 4.0 alpha
        Hide
        simonw Simon Willnauer added a comment -

        committed to trunk in rev. 1341768. I send out a headsup mail to the dev list since this breaks the index file format.

        thanks for reviewing.... lets get 4.0-alpha out!

        Show
        simonw Simon Willnauer added a comment - committed to trunk in rev. 1341768. I send out a headsup mail to the dev list since this breaks the index file format. thanks for reviewing.... lets get 4.0-alpha out!
        Hide
        thetaphi Uwe Schindler added a comment -

        Thank you very much! Now file formats are finally consistent. Maybe our index files' consistent magic numbers now also get added to the "file" unix command

        Show
        thetaphi Uwe Schindler added a comment - Thank you very much! Now file formats are finally consistent. Maybe our index files' consistent magic numbers now also get added to the "file" unix command

          People

          • Assignee:
            simonw Simon Willnauer
            Reporter:
            simonw Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development