• Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
    • Lucene Fields:


      I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.

      SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.

      This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information.

      Then the plain text SegmentInfos would contain just the following information:

      • list of global files for this commit point (if any)
      • list of segments for this commit point, and their corresponding codec class names
      • user data map

        Issue Links


          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue is depended upon by LUCENE-4055 [ LUCENE-4055 ]
          Gavin made changes -
          Link This issue blocks LUCENE-4055 [ LUCENE-4055 ]
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 4.0-ALPHA [ 12314025 ]
          Fix Version/s 4.0 [ 12322456 ]
          Resolution Fixed [ 1 ]
          Hoss Man made changes -
          Fix Version/s 4.0 [ 12322456 ]
          Fix Version/s 4.0-ALPHA [ 12314025 ]
          Andrzej Bialecki made changes -
          Assignee Robert Muir [ rcmuir ]
          Andrzej Bialecki made changes -
          Issue Type Improvement [ 4 ] Bug [ 1 ]
          Andrzej Bialecki made changes -
          Link This issue blocks LUCENE-4055 [ LUCENE-4055 ]
          Andrzej Bialecki made changes -
          Field Original Value New Value
          Summary Change SegmentInfos format to plain text Make segments_NN file codec-independent
          Andrzej Bialecki created issue -


            • Assignee:
              Robert Muir
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              5 Start watching this issue


              • Created: