Lucene - Core
  1. Lucene - Core
  2. LUCENE-4051

Fix File Headers for Lucene40 StoredFields & TermVectors

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently we still write the "old" file header format in Lucene40StoredFieldFormat & Lucene40TermVectorsFormat. We should cut over to use CodecUtil and reset the versioning before we release Lucene 4.0

      1. LUCENE-4051.patch
        42 kB
        Simon Willnauer
      2. LUCENE-4051.patch
        41 kB
        Simon Willnauer
      3. LUCENE-4051.patch
        37 kB
        Simon Willnauer
      4. LUCENE-4051.patch
        20 kB
        Simon Willnauer
      5. LUCENE-4051.patch
        21 kB
        Simon Willnauer

        Activity

        Hide
        Uwe Schindler added a comment -

        Thank you very much! Now file formats are finally consistent. Maybe our index files' consistent magic numbers now also get added to the "file" unix command

        Show
        Uwe Schindler added a comment - Thank you very much! Now file formats are finally consistent. Maybe our index files' consistent magic numbers now also get added to the "file" unix command
        Hide
        Simon Willnauer added a comment -

        committed to trunk in rev. 1341768. I send out a headsup mail to the dev list since this breaks the index file format.

        thanks for reviewing.... lets get 4.0-alpha out!

        Show
        Simon Willnauer added a comment - committed to trunk in rev. 1341768. I send out a headsup mail to the dev list since this breaks the index file format. thanks for reviewing.... lets get 4.0-alpha out!
        Hide
        Robert Muir added a comment -

        +1 to commit, thanks for working this! Lets get this in asap. as soon as file formats
        changes are solidified we can start working towards a release candidate for 4.0 alpha

        Show
        Robert Muir added a comment - +1 to commit, thanks for working this! Lets get this in asap. as soon as file formats changes are solidified we can start working towards a release candidate for 4.0 alpha
        Hide
        Simon Willnauer added a comment -

        avadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as:

        good that you reviewed it! I missed those! here is an updated patch. I plan to commit this soon...

        Show
        Simon Willnauer added a comment - avadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as: good that you reviewed it! I missed those! here is an updated patch. I plan to commit this soon...
        Hide
        Robert Muir added a comment -

        javadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as:

        * <p>Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums,
        

        I think TVDVersion should just say Header (i noticed this for TVF etc too)

        Show
        Robert Muir added a comment - javadocs look good: trivial nitpick i noticed some relics in TermVectorsFormat such as: * <p>Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, I think TVDVersion should just say Header (i noticed this for TVF etc too)
        Hide
        Simon Willnauer added a comment -

        update codec format documenation. I think its ready...

        Show
        Simon Willnauer added a comment - update codec format documenation. I think its ready...
        Hide
        Simon Willnauer added a comment -

        Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header?

        absolutely. this is still WIP since I need to check other files if we can / should make it writing a per-file header.

        Show
        Simon Willnauer added a comment - Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header? absolutely. this is still WIP since I need to check other files if we can / should make it writing a per-file header.
        Hide
        Robert Muir added a comment -

        Patch looks good!

        Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header?
        In the other places we just treat it like a primitive type and link it to CodecUtil#writeHeader where its
        format is described in detail.

        Show
        Robert Muir added a comment - Patch looks good! Can you update the fileformat javadocs (in Lucene40*Format.java) to document the codec header? In the other places we just treat it like a primitive type and link it to CodecUtil#writeHeader where its format is described in detail.
        Hide
        Simon Willnauer added a comment -

        here is a new patch adding per file headers (while reusing the version). I also cut over the docvalues headers to be per file.

        I think this is a sep. issues but if we can batch up File Format changes I think we should do it. I will look for other codecs to get this consistent where possible.

        Show
        Simon Willnauer added a comment - here is a new patch adding per file headers (while reusing the version). I also cut over the docvalues headers to be per file. I think this is a sep. issues but if we can batch up File Format changes I think we should do it. I will look for other codecs to get this consistent where possible.
        Hide
        Robert Muir added a comment -

        One more comment: i think actually we should use different codecnames for each file.

        (Like for .fdt versus .fdx and so on for the 3 term vectors files). This way we are
        validating that we actually read the stream we think we are reading.

        Show
        Robert Muir added a comment - One more comment: i think actually we should use different codecnames for each file. (Like for .fdt versus .fdx and so on for the 3 term vectors files). This way we are validating that we actually read the stream we think we are reading.
        Hide
        Simon Willnauer added a comment -

        By the way: we can use CodecUtil.headerLength to have a static final

        for the length of the header (rather than getFilePointer).

        new patch using a static HEADER_LENGTH for this. thanks robert for the pointer.

        Show
        Simon Willnauer added a comment - By the way: we can use CodecUtil.headerLength to have a static final for the length of the header (rather than getFilePointer). new patch using a static HEADER_LENGTH for this. thanks robert for the pointer.
        Hide
        Robert Muir added a comment -

        By the way: we can use CodecUtil.headerLength to have a static final
        for the length of the header (rather than getFilePointer).

        I think in general we should do this since the codec name (and thus
        header length) is still fixed.

        Show
        Robert Muir added a comment - By the way: we can use CodecUtil.headerLength to have a static final for the length of the header (rather than getFilePointer). I think in general we should do this since the codec name (and thus header length) is still fixed.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Simon Willnauer added a comment -

        here is a first patch

        Show
        Simon Willnauer added a comment - here is a first patch

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development