Lucene - Core
  1. Lucene - Core
  2. LUCENE-4050

Make segments_NN file codec-independent

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.

      SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.

      This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information.

      Then the plain text SegmentInfos would contain just the following information:

      • list of global files for this commit point (if any)
      • list of segments for this commit point, and their corresponding codec class names
      • user data map

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I agree this is a total mess. We should really revisit how we handle:

          1. commit file (in my opinion this should just be a list of segments! only!)
            currently segmentinfos stores a ton of stuff more than this, it stores
            per-segment metadata within this file when it really should not.
          2. per-segment metadata. In this case we have a lot of confusion with
            segmentinfo and fieldinfo. It would be great for the codec to have more
            flexibility here, via abstract classes/interfaces+attributes or something
            that ensures its lossless yet still a codec can add what it needs. Really
            for the most part segmentinfo is basically useless since many values actually
            return "well if you want to know this, then go look at the fieldinfos".
          3. actual commit strategy. We do a lot of funky stuff like writing fake bogus
            data, seeking backwards, etc. Why not just a normal atomic rename like
            any other computer program on the planet????
          Show
          Robert Muir added a comment - I agree this is a total mess. We should really revisit how we handle: commit file (in my opinion this should just be a list of segments! only!) currently segmentinfos stores a ton of stuff more than this, it stores per-segment metadata within this file when it really should not. per-segment metadata. In this case we have a lot of confusion with segmentinfo and fieldinfo. It would be great for the codec to have more flexibility here, via abstract classes/interfaces+attributes or something that ensures its lossless yet still a codec can add what it needs. Really for the most part segmentinfo is basically useless since many values actually return "well if you want to know this, then go look at the fieldinfos". actual commit strategy. We do a lot of funky stuff like writing fake bogus data, seeking backwards, etc. Why not just a normal atomic rename like any other computer program on the planet????
          Hide
          Michael McCandless added a comment -

          +1 to fully separate (separate files maybe?) the codec-neutral "list
          of committed segments" from "the codec-specific details/metadata for
          each segment".

          Then, a codec can easily store its own stuff in the segment metadata.

          And I agree the FieldInfo/SegmentInfo duality is confusing...

          Plain text encoding of these files would be really nice but isn't as
          important, I think... and will be a fair amount of work (I suspect we
          need a JSON or YAML or something that represents lists, maps,
          different native types, etc.). I think this is separate / can come
          later.

          We do a lot of funky stuff like writing fake bogus
          data, seeking backwards, etc. Why not just a normal atomic rename like
          any other computer program on the planet????

          In fact Lucene used to use rename to commit the segments file but this
          proved problematic on Windows (sometimes the rename would hit "access
          denied" error).

          Show
          Michael McCandless added a comment - +1 to fully separate (separate files maybe?) the codec-neutral "list of committed segments" from "the codec-specific details/metadata for each segment". Then, a codec can easily store its own stuff in the segment metadata. And I agree the FieldInfo/SegmentInfo duality is confusing... Plain text encoding of these files would be really nice but isn't as important, I think... and will be a fair amount of work (I suspect we need a JSON or YAML or something that represents lists, maps, different native types, etc.). I think this is separate / can come later. We do a lot of funky stuff like writing fake bogus data, seeking backwards, etc. Why not just a normal atomic rename like any other computer program on the planet???? In fact Lucene used to use rename to commit the segments file but this proved problematic on Windows (sometimes the rename would hit "access denied" error).
          Hide
          Robert Muir added a comment -

          In fact Lucene used to use rename to commit the segments file but this
          proved problematic on Windows (sometimes the rename would hit "access
          denied" error).

          Well, problematic at least once right? I dont think it justifies doing
          things a strange way.

          Surely this is just some problem only on windows 3.1 and java 1.2 or
          something and now fixed, since this is how every other linux/cygwin program
          (e.g. vi) works.

          Show
          Robert Muir added a comment - In fact Lucene used to use rename to commit the segments file but this proved problematic on Windows (sometimes the rename would hit "access denied" error). Well, problematic at least once right? I dont think it justifies doing things a strange way. Surely this is just some problem only on windows 3.1 and java 1.2 or something and now fixed, since this is how every other linux/cygwin program (e.g. vi) works.
          Hide
          Andrzej Bialecki added a comment -

          Plain text encoding of these files would be really nice but isn't as important, I think...

          Yeah, it could be sufficient if we would agree on necessarily separate the "plain list of segments:codec" from the segmentInfo/fieldInfo parts and push those parts down to the codec-specific formats.

          Then we could just use a version number as the first element of this file to allow for extensions in the future, like e.g. switching to JSON or to some other format du jour.

          Surely this is just some problem only on windows 3.1 and java 1.2 or something and now fixed, since this is how every other linux/cygwin program (e.g. vi) works.

          I'm not so sure. I know for a fact that Windows doesn't allow renames or deletes of open files, no matter if it's open by you or by some other process (e.g. user examining the file in Notepad.exe), and IIRC the issue was that JVM doesn't release OS file handles quickly enough.

          Show
          Andrzej Bialecki added a comment - Plain text encoding of these files would be really nice but isn't as important, I think... Yeah, it could be sufficient if we would agree on necessarily separate the "plain list of segments:codec" from the segmentInfo/fieldInfo parts and push those parts down to the codec-specific formats. Then we could just use a version number as the first element of this file to allow for extensions in the future, like e.g. switching to JSON or to some other format du jour. Surely this is just some problem only on windows 3.1 and java 1.2 or something and now fixed, since this is how every other linux/cygwin program (e.g. vi) works. I'm not so sure. I know for a fact that Windows doesn't allow renames or deletes of open files, no matter if it's open by you or by some other process (e.g. user examining the file in Notepad.exe), and IIRC the issue was that JVM doesn't release OS file handles quickly enough.
          Hide
          Andrzej Bialecki added a comment -

          Discussing this further with Robert, it looks like this is a (smaller) part of a larger issue, in that SegmentInfo+FieldInfo should be made extensible and the process of reading/writing this information should be completely codec-specific. Let's make a separate issue for that part.

          And the smaller issue discussed here is to record only the information about a commit point in a completely codec-independent, versioned format, whatever that format is. Let's call it CommitInfo or whatever other name fits. This part would be written to a file that is separate from the codec-dependent parts.

          Regarding two-phase commit and checksums - one reason we have SegmentInfosWriter/Reader was the AppendingCodec, because we couldn't make it work for append-only filesystems. However, we could change the two-phase commit implementation to the following:

          • write the data to the CommitInfo file
          • write a marker indicating "end of data, checksum follows"
          • finally, write the checksum

          Then the reading code knows that:

          • if there's a marker missing then the file is invalid
          • if the marker is present then the checksum must be present too
          • and the checksum must be correct.

          This implementation doesn't require seek back / overwrite so it's supported on any filesystem.

          Show
          Andrzej Bialecki added a comment - Discussing this further with Robert, it looks like this is a (smaller) part of a larger issue, in that SegmentInfo+FieldInfo should be made extensible and the process of reading/writing this information should be completely codec-specific . Let's make a separate issue for that part. And the smaller issue discussed here is to record only the information about a commit point in a completely codec-independent, versioned format , whatever that format is. Let's call it CommitInfo or whatever other name fits. This part would be written to a file that is separate from the codec-dependent parts. Regarding two-phase commit and checksums - one reason we have SegmentInfosWriter/Reader was the AppendingCodec, because we couldn't make it work for append-only filesystems. However, we could change the two-phase commit implementation to the following: write the data to the CommitInfo file write a marker indicating "end of data, checksum follows" finally, write the checksum Then the reading code knows that: if there's a marker missing then the file is invalid if the marker is present then the checksum must be present too and the checksum must be correct. This implementation doesn't require seek back / overwrite so it's supported on any filesystem.
          Hide
          Andrzej Bialecki added a comment -

          Changing the title to better reflect the scope of this issue.

          Show
          Andrzej Bialecki added a comment - Changing the title to better reflect the scope of this issue.
          Hide
          Andrzej Bialecki added a comment -

          It's actually a bug - it's not possible to cleanly extend index format via Codec-s without addressing this issue.

          Show
          Andrzej Bialecki added a comment - It's actually a bug - it's not possible to cleanly extend index format via Codec-s without addressing this issue.
          Hide
          Michael McCandless added a comment -

          However, we could change the two-phase commit implementation to the following:

          I think that's a good solution? It seems important to keep the non-codec-controlled write/read as simple as possible...

          The only small thing we lose is if a disk full is going to strike... today we write the 0s ahead (in prepareCommit) so that we'll hit disk full during prepareCommit and not commit... but I think the chance of those 4 bytes hitting the disk full is very low so the simpler code is better...

          Show
          Michael McCandless added a comment - However, we could change the two-phase commit implementation to the following: I think that's a good solution? It seems important to keep the non-codec-controlled write/read as simple as possible... The only small thing we lose is if a disk full is going to strike... today we write the 0s ahead (in prepareCommit) so that we'll hit disk full during prepareCommit and not commit... but I think the chance of those 4 bytes hitting the disk full is very low so the simpler code is better...
          Hide
          Andrzej Bialecki added a comment -

          The only small thing we lose is if a disk full is going to strike...

          I thought about this too - if it's really a big concern we could use the following trick: > 99% filesystems keep data in blocks that are multiples of 512 bytes. We could add filler bytes at the end of the file so that it comes out to a round multiple of 512 B, and only then append the marker and the checksum. This way we will know that writing a marker required allocation of a new block, and if it succeeded then writing a checksum should also succeed.

          Show
          Andrzej Bialecki added a comment - The only small thing we lose is if a disk full is going to strike... I thought about this too - if it's really a big concern we could use the following trick: > 99% filesystems keep data in blocks that are multiples of 512 bytes. We could add filler bytes at the end of the file so that it comes out to a round multiple of 512 B, and only then append the marker and the checksum. This way we will know that writing a marker required allocation of a new block, and if it succeeded then writing a checksum should also succeed.
          Hide
          Marvin Humphrey added a comment -

          Ever considered using hard links instead of renaming?

          Show
          Marvin Humphrey added a comment - Ever considered using hard links instead of renaming?
          Hide
          Michael McCandless added a comment -

          Ever considered using hard links instead of renaming?

          That's a neat option ... but I think it's only in Java 7 that we can create hard links (java.nio.file.Files.createLink)? And even then it's an optional operation...

          Show
          Michael McCandless added a comment - Ever considered using hard links instead of renaming? That's a neat option ... but I think it's only in Java 7 that we can create hard links (java.nio.file.Files.createLink)? And even then it's an optional operation...
          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hide
          Robert Muir added a comment -

          This was never resolved (segments_N file is codec-independent now)

          Show
          Robert Muir added a comment - This was never resolved (segments_N file is codec-independent now)

            People

            • Assignee:
              Robert Muir
              Reporter:
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development