Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index, core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently the codec name is stored for every segment and PerFieldCodecWrapper is used to enable codecs per fields which has recently brought up some issues (LUCENE-2740 and LUCENE-2741). When a codec name is stored lucene does not respect the actual codec used to encode a fields postings but rather the "top-level" Codec in such a case the name of the top-level codec is "PerField" instead of "Pulsing" or "Standard" etc. The way this composite pattern works make the indexing part of codecs simpler but also limits its capabilities. By recoding the top-level codec in the segments file we rely on the user to "configure" the PerFieldCodecWrapper correctly to open a SegmentReader. If a fields codec has changed in the meanwhile we won't be able to open the segment.

      The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the way PFCW is implemented right now. PFCW blindly creates codecs per field on request and at the same time doesn't have any control over the file naming nor if a two codec instances are created for two distinct fields even if the codec instance is the same. If so FieldsConsumer will throw an exception since the files it relies on are already created.

      Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. In order to use per field codec a user should on the one hand register its custom codecs AND needs to build a PFCW which needs to be maintained in the "user-land" an must not change incompatible once a new IW of IR is created. What I would expect from Lucene is to enable me to register a codec in a provider and then tell the Field which codec it should use for indexing. For reading lucene should determ the codec automatically once a segment is opened. if the codec is not available in the provider that is a different story. Once we instantiate the composite codec in SegmentsReader we only have the codecs which are really used in this segment for free which in turn solves LUCENE-2740.

      Yet, instead of relying on the user to configure PFCW I suggest to move composite codec functionality inside the core an record the distinct codecs per segment in the segments info. We only really need the distinct codecs used in that segment since the codec instance should be reused to prevent additional files to be created. Lets say we have the follwing codec mapping :

      field_a:Pulsing
      field_b:Standard
      field_c:Pulsing
      

      then we create the following mapping:

      SegmentInfo:
      [Pulsing, Standard]
      
      PerField:
      [field_a:0, field_b:1, field_c:0]
      

      that way we can easily determ which codec is used for which field an build the composite - codec internally on opening SegmentsReader. This ordering has another advantage, if like in that case pulsing and standard use really the same type of files we need a way to distinguish the used files per codec within a segment. We can in turn pass the codec's ord (implicit in the SegmentInfo) to the FieldConsumer on creation to create files with segmentname_ord.ext (or something similar). This solvel LUCENE-2741).

      1. LUCENE-2742.patch
        66 kB
        Simon Willnauer
      2. LUCENE-2742.patch
        74 kB
        Simon Willnauer
      3. LUCENE-2742.patch
        75 kB
        Simon Willnauer
      4. LUCENE-2742.patch
        76 kB
        Simon Willnauer
      5. LUCENE-2742.patch
        78 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          committed in revision 1033577

          I will send a heads up to the list due to the file format change

          Show
          Simon Willnauer added a comment - committed in revision 1033577 I will send a heads up to the list due to the file format change
          Hide
          Simon Willnauer added a comment -

          +1 to commit - great work Simon!

          I should note that this patch changes the index file format and it is not compatible with indexes build with previous TRUNK version. Indexes build with previous lucene versions will still work like a charm

          I will go ahead and commit

          Show
          Simon Willnauer added a comment - +1 to commit - great work Simon! I should note that this patch changes the index file format and it is not compatible with indexes build with previous TRUNK version. Indexes build with previous lucene versions will still work like a charm I will go ahead and commit
          Hide
          Michael McCandless added a comment -

          +1 to commit – great work Simon!

          Show
          Michael McCandless added a comment - +1 to commit – great work Simon!
          Hide
          Simon Willnauer added a comment -

          I added javadoc to SegmentCodecs and PerFieldCodecWrapper even they are @lucene.internal. I think it helps other devs to get into the internal and hopefully transports knowledge. I think we should make that mandatory for new classes and changes to at least document the big picture

          Show
          Simon Willnauer added a comment - I added javadoc to SegmentCodecs and PerFieldCodecWrapper even they are @lucene.internal. I think it helps other devs to get into the internal and hopefully transports knowledge. I think we should make that mandatory for new classes and changes to at least document the big picture
          Hide
          Simon Willnauer added a comment -

          the testcases are included in this patch which fixes the problem

          Show
          Simon Willnauer added a comment - the testcases are included in this patch which fixes the problem
          Hide
          Simon Willnauer added a comment -

          added JavaDoc to CodecProvider. I think we are ready to commit.

          Show
          Simon Willnauer added a comment - added JavaDoc to CodecProvider. I think we are ready to commit.
          Hide
          Simon Willnauer added a comment -

          Another iteration.

          I fixed some code style violations (thanks mike , renamed CodecInfo to SegmentCodecs and made PerFieldCodecWrapper and SegmentCodecs package private. I also removed the suger from Fieldable and AbstractField. This patch also makes CodecProvider a real class (not abstract) and adds write once behavior to CodecProvider#setFieldCodec()

          I think this is very close now.

          Show
          Simon Willnauer added a comment - Another iteration. I fixed some code style violations (thanks mike , renamed CodecInfo to SegmentCodecs and made PerFieldCodecWrapper and SegmentCodecs package private. I also removed the suger from Fieldable and AbstractField. This patch also makes CodecProvider a real class (not abstract) and adds write once behavior to CodecProvider#setFieldCodec() I think this is very close now.
          Hide
          Simon Willnauer added a comment -

          Updated patch which follows a slightly different approach. I build the codec ID record during segment flush since I learned the hard way that FieldInfo is maintained per IW session. I also added testcases including the ones from LUCENE-2740 an all test pass for me.

          Review and comments highly appreciated (marvin, mike?

          PS: I will leave the files support to LUCENE-2741 but this already provide the infrastructure...

          Show
          Simon Willnauer added a comment - Updated patch which follows a slightly different approach. I build the codec ID record during segment flush since I learned the hard way that FieldInfo is maintained per IW session. I also added testcases including the ones from LUCENE-2740 an all test pass for me. Review and comments highly appreciated (marvin, mike? PS: I will leave the files support to LUCENE-2741 but this already provide the infrastructure...
          Hide
          Simon Willnauer added a comment -

          Here is a first patch - all tests pass. I changed the CodecProvider interface slightly to be able to hold perField codecs as well as a default perField codec. For simplicity users can not register their codec directly though the Fieldable interface. Internally I added a CodecInfo which handles all the ordering and registration per segment / field. For consistency I bound CodecInfo to FieldInfos since we are now operating per field. A codec can only be assigned once, at the first time we see the codec during FieldInfos creation.

          there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate first to see if we wanna go that path - it seems close.

          Show
          Simon Willnauer added a comment - Here is a first patch - all tests pass. I changed the CodecProvider interface slightly to be able to hold perField codecs as well as a default perField codec. For simplicity users can not register their codec directly though the Fieldable interface. Internally I added a CodecInfo which handles all the ordering and registration per segment / field. For consistency I bound CodecInfo to FieldInfos since we are now operating per field. A codec can only be assigned once, at the first time we see the codec during FieldInfos creation. there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate first to see if we wanna go that path - it seems close.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development