Lucene - Core
  1. Lucene - Core
  2. LUCENE-4055

Refactor SegmentInfo / FieldInfo to make them extensible

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes should be made abstract so that they can be extended by Codec-s.

      1. LUCENE-4055.patch
        767 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          rough plan:

          • create branch for this and LUCENE-4050, the work is related really.
          • separate per commit information (SegmentList? Commit?) from per-segment metadata (SegmentInfo? SegmentMetadata?). the latter is basically si and fi. these names can change. maybe fi is still preserved mostly asis.
          • clean up the way per-segment metadata is used in such a way that it is abstract and minimal to what IndexWriter/MP needs, not full of codec specific stuff.
          • ensure codecs can privately write any metadat they need (eg hasProx) and that its accessible via all codec apis.
          • try to ensure codecs write things they need like hasProx to compute files() without iterating over fieldinfos. this would be a bonus, its currently a concern of mine.
          • add a basic introspection api (like map<string,string>) for tools like luke to be able to display codec private values.
          Show
          Robert Muir added a comment - rough plan: create branch for this and LUCENE-4050 , the work is related really. separate per commit information (SegmentList? Commit?) from per-segment metadata (SegmentInfo? SegmentMetadata?). the latter is basically si and fi. these names can change. maybe fi is still preserved mostly asis. clean up the way per-segment metadata is used in such a way that it is abstract and minimal to what IndexWriter/MP needs, not full of codec specific stuff. ensure codecs can privately write any metadat they need (eg hasProx) and that its accessible via all codec apis. try to ensure codecs write things they need like hasProx to compute files() without iterating over fieldinfos. this would be a bonus, its currently a concern of mine. add a basic introspection api (like map<string,string>) for tools like luke to be able to display codec private values.
          Hide
          Andrzej Bialecki added a comment -

          +1. Per commit information could be named CommitInfo (which it really is). I like SegmentMetadata - if we left SegmentInfo as a name it would be confusing with its current functionality.

          Introspection could return Map<String,Object> to avoid converting e.g. numeric values back and forth.

          Show
          Andrzej Bialecki added a comment - +1. Per commit information could be named CommitInfo (which it really is). I like SegmentMetadata - if we left SegmentInfo as a name it would be confusing with its current functionality. Introspection could return Map<String,Object> to avoid converting e.g. numeric values back and forth.
          Hide
          Robert Muir added a comment -
          Show
          Robert Muir added a comment - Branch location for this issue and LUCENE-4050 : https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4055
          Hide
          Robert Muir added a comment -

          Just some updates from the work in the branch (scary changes but proceeding nicely since Mike jumped in and did a lot of it).
          Here's a list of the current progress:

          • on disk, the segments_N is reduced to the stuff that actually is per-commit: a list of segments and deleted gens/counts, etc.
          • per-segment metadata (doc count, diagnostics, etc) that is write-once is encoded by the codec, e.g. for 4.0's codec this is in the .si file.
          • removed backwards-seeking on segments_N. so appendingcodec still works but doesn't need any special hacks.
          • flush/merge order is changed so that fieldinfos are written last so codecs have a chance to add metadata to it.
          • fieldinfo has a "codec metadata" api that codec components can use, and that metadata will be available on reading the segment. this metadata
            is for the codec to use to extend fieldinfo, its not carried along during merge or anything.
          • PerFieldPostingsFormat is changed to use the fieldinfo metadata api rather than a separate .per file (e.g. it records that the "id" field uses Pulsing).
          • all the hairiness involving files() is really nice now, instead we simply track which files were created, and add them to the .si file. Previously
            there was a lot of logic to compute this in a symmetric way at both read and write time, and if you had a bug, your punishment was FNFE.

          not yet done:

          • add metadata api to segmentinfo too, so that codec components can record per-segment information that they care about.
          • see if we can implement 3.x's shared doc stores support with segmentinfo metadata api. This is tricky to do and for addIndexes/indexSplitter etc which
            do sneaky things to still work.
          • see if we can implement 3.x normGen (separate norms) with segmentinfo metadata. while in 3.x lucene this was actually per-commit, since 3.x support
            is read-only we can effectively treat it as per-segment this way.
          • rename stuff so that we have a clearer naming for some of these classes.

          I'm also probably missing a few other things. In general I'm pretty happy with the "metadata" key-value attributes api versus subclassing.

          I tried to make subclassing work, but subclassing turned really ugly fast and made various codec components too tightly-coupled, e.g.
          if someone wants to combine a CompressedStoredFields with a PerFieldPostingsFormat and SpecialTermVectors, what would the impls be .

          So the overly simple Map<String,String> avoids these issues, and hey its just metadata after all so I don't think anything more complex is really needed.

          Show
          Robert Muir added a comment - Just some updates from the work in the branch (scary changes but proceeding nicely since Mike jumped in and did a lot of it). Here's a list of the current progress: on disk, the segments_N is reduced to the stuff that actually is per-commit: a list of segments and deleted gens/counts, etc. per-segment metadata (doc count, diagnostics, etc) that is write-once is encoded by the codec, e.g. for 4.0's codec this is in the .si file. removed backwards-seeking on segments_N. so appendingcodec still works but doesn't need any special hacks. flush/merge order is changed so that fieldinfos are written last so codecs have a chance to add metadata to it. fieldinfo has a "codec metadata" api that codec components can use, and that metadata will be available on reading the segment. this metadata is for the codec to use to extend fieldinfo, its not carried along during merge or anything. PerFieldPostingsFormat is changed to use the fieldinfo metadata api rather than a separate .per file (e.g. it records that the "id" field uses Pulsing). all the hairiness involving files() is really nice now, instead we simply track which files were created, and add them to the .si file. Previously there was a lot of logic to compute this in a symmetric way at both read and write time, and if you had a bug, your punishment was FNFE. not yet done: add metadata api to segmentinfo too, so that codec components can record per-segment information that they care about. see if we can implement 3.x's shared doc stores support with segmentinfo metadata api. This is tricky to do and for addIndexes/indexSplitter etc which do sneaky things to still work. see if we can implement 3.x normGen (separate norms) with segmentinfo metadata. while in 3.x lucene this was actually per-commit, since 3.x support is read-only we can effectively treat it as per-segment this way. rename stuff so that we have a clearer naming for some of these classes. I'm also probably missing a few other things. In general I'm pretty happy with the "metadata" key-value attributes api versus subclassing. I tried to make subclassing work, but subclassing turned really ugly fast and made various codec components too tightly-coupled, e.g. if someone wants to combine a CompressedStoredFields with a PerFieldPostingsFormat and SpecialTermVectors, what would the impls be . So the overly simple Map<String,String> avoids these issues, and hey its just metadata after all so I don't think anything more complex is really needed.
          Hide
          Robert Muir added a comment -

          The last nocommits are cleared up: I think the branch is ready to be merged.

          Attached is a patch showing the differences between trunk and branch.

          Show
          Robert Muir added a comment - The last nocommits are cleared up: I think the branch is ready to be merged. Attached is a patch showing the differences between trunk and branch.
          Hide
          Andrzej Bialecki added a comment -

          +1, this looks very good.

          One comment re. SegmentInfoPerCommit. This class is not extensible and contains a fixed set of attributes. In LUCENE-3837 this or similar place would be the ideal mechanism to carry info about stacked segments, since this information is specific to a commit point. Unfortunately, there are no Map<String,String> attributes on this level, so I guess for now this type of aux data will have to be put in SegmentInfos.userData even though it's not index global but segment-specific.

          Show
          Andrzej Bialecki added a comment - +1, this looks very good. One comment re. SegmentInfoPerCommit. This class is not extensible and contains a fixed set of attributes. In LUCENE-3837 this or similar place would be the ideal mechanism to carry info about stacked segments, since this information is specific to a commit point. Unfortunately, there are no Map<String,String> attributes on this level, so I guess for now this type of aux data will have to be put in SegmentInfos.userData even though it's not index global but segment-specific.
          Hide
          Robert Muir added a comment -

          Right, but I think this is correct: the codec should be responsible for encode/decode of inverted index segments only (the whole problem here originally was trying to have it also look after commits).

          So it really shouldn't be customizing things about the commit, as that creates a confusing impedance mismatch.

          I think things like stacked segments in LUCENE-3837 that need to do things other than implement encoding/decoding of segment should be above the codec level: since its a separate concern, if someone wants to have updatable fields thats unrelated to the integer compression algorithm used or what not.

          Show
          Robert Muir added a comment - Right, but I think this is correct: the codec should be responsible for encode/decode of inverted index segments only (the whole problem here originally was trying to have it also look after commits). So it really shouldn't be customizing things about the commit, as that creates a confusing impedance mismatch. I think things like stacked segments in LUCENE-3837 that need to do things other than implement encoding/decoding of segment should be above the codec level: since its a separate concern, if someone wants to have updatable fields thats unrelated to the integer compression algorithm used or what not.
          Hide
          Andrzej Bialecki added a comment - - edited

          stacked segments in LUCENE-3837 that need to do things other than implement encoding/decoding of segment should be above the codec level ..

          Certainly, that's why it would make sense to put this extended info in SegmentInfoPerCommit and not in any file handled by Codec. My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so LUCENE-3837 will need to use for now the codec-independent index-global data (SegmentInfos). It's not ideal but not a deal breaker either, especially since we now have version info in both of these places.

          Show
          Andrzej Bialecki added a comment - - edited stacked segments in LUCENE-3837 that need to do things other than implement encoding/decoding of segment should be above the codec level .. Certainly, that's why it would make sense to put this extended info in SegmentInfoPerCommit and not in any file handled by Codec. My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so LUCENE-3837 will need to use for now the codec-independent index-global data (SegmentInfos). It's not ideal but not a deal breaker either, especially since we now have version info in both of these places.
          Hide
          Robert Muir added a comment -

          My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so LUCENE-3837 will need to use for now the codec-independent index-global data (SegmentInfos).

          Well SegmentInfos is just the list of SegmentInfoPerCommit, so I think in the case of LUCENE-3837 we would just place this data alongside the only other per-segment-per-commit data: deletes (e.g. we would add something like updatesGen and updatesCount or whatever).

          Sure, we have to bump the file header and what not, but the file is so simple now in the sense its just a list of segment names, the codec to decode them, with their deletes, that it wouldn't be a big deal.

          And I think per-segment-per-commit data is pretty rare: we only have deletes, and in the future updates, but I can't imagine lots of other stuff belonging in this category (versions the per-segment metadata in SI which has been pretty volatile in the past).

          Show
          Robert Muir added a comment - My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so LUCENE-3837 will need to use for now the codec-independent index-global data (SegmentInfos). Well SegmentInfos is just the list of SegmentInfoPerCommit, so I think in the case of LUCENE-3837 we would just place this data alongside the only other per-segment-per-commit data: deletes (e.g. we would add something like updatesGen and updatesCount or whatever). Sure, we have to bump the file header and what not, but the file is so simple now in the sense its just a list of segment names, the codec to decode them, with their deletes, that it wouldn't be a big deal. And I think per-segment-per-commit data is pretty rare: we only have deletes, and in the future updates, but I can't imagine lots of other stuff belonging in this category (versions the per-segment metadata in SI which has been pretty volatile in the past).
          Hide
          Renaud Delbru added a comment - - edited

          Does this patch allows TermsHashConsumerPerField to be dependent to the Codec/Field ? I have a use case where I need my own TermsHashConsumerPerField for certain fields.

          At the moment, the only extension of TermsHashConsumer is FreqProxTermsWriter which is creating a FreqProxTermsWriterPerField for every field (see FreqProxTermsWriter#addField()).

          A possible solution of my problem would be to have a PostingsTermsWriter (extension of TermsHashConsumer) that will create a specific PostingsTermsWriterPerField (extension of TermsHashConsumerPerField) depending on the codec information found in the FieldInfo object.

          Show
          Renaud Delbru added a comment - - edited Does this patch allows TermsHashConsumerPerField to be dependent to the Codec/Field ? I have a use case where I need my own TermsHashConsumerPerField for certain fields. At the moment, the only extension of TermsHashConsumer is FreqProxTermsWriter which is creating a FreqProxTermsWriterPerField for every field (see FreqProxTermsWriter#addField()). A possible solution of my problem would be to have a PostingsTermsWriter (extension of TermsHashConsumer) that will create a specific PostingsTermsWriterPerField (extension of TermsHashConsumerPerField) depending on the codec information found in the FieldInfo object.
          Hide
          Robert Muir added a comment -

          Renaud, what you are talking about is not really related to this issue: this issue is about allowing codecs to add metadata to SegmentInfo and FieldInfo.

          what you are talking about is the indexing chain, if you want to customize that, just set a custom one on IndexWriterConfig.

          Show
          Robert Muir added a comment - Renaud, what you are talking about is not really related to this issue: this issue is about allowing codecs to add metadata to SegmentInfo and FieldInfo. what you are talking about is the indexing chain, if you want to customize that, just set a custom one on IndexWriterConfig.
          Hide
          Renaud Delbru added a comment -

          Hi Robert,

          sorry if it seemed a bit out of context, but I am trying to understand how to properly do it.

          Indeed, I can create my own indexing chain which includes my TermsHashConsumer customisation. However, I would still need the codec metadata for every field. But from what you told me, it seems that this codec specific metadata could be now added to FeildInfo. Is that correct ?

          Show
          Renaud Delbru added a comment - Hi Robert, sorry if it seemed a bit out of context, but I am trying to understand how to properly do it. Indeed, I can create my own indexing chain which includes my TermsHashConsumer customisation. However, I would still need the codec metadata for every field. But from what you told me, it seems that this codec specific metadata could be now added to FeildInfo. Is that correct ?
          Hide
          Robert Muir added a comment -

          However, I would still need the codec metadata for every field.

          What codec metadata? this change only allows codecs to add additional things to fieldinfo/segmentinfo so they can later be read when the segment is opened. E.g. a CompressingStoredFieldsWriter could put in the segmentinfo an additional key-value like "CompressingStoredFieldsWriter.algorithm=deflate"

          These are private to that component basically: i don't understand why your indexing chain would care about this? its at a level above all that.

          Show
          Robert Muir added a comment - However, I would still need the codec metadata for every field. What codec metadata? this change only allows codecs to add additional things to fieldinfo/segmentinfo so they can later be read when the segment is opened. E.g. a CompressingStoredFieldsWriter could put in the segmentinfo an additional key-value like "CompressingStoredFieldsWriter.algorithm=deflate" These are private to that component basically: i don't understand why your indexing chain would care about this? its at a level above all that.
          Hide
          Renaud Delbru added a comment -

          What codec metadata?

          Metadata that indicates which codec is used for a particular field.

          Let say I want to have a specific TermsHashConsumerPerField depending on the codec used by a field. For example, for field A and field B which use the Lucen40 codec, we need to use the FreqProxTermsWriterPerField. And for field C that uses my own specific codec, I need to use the MyOwnTermsWriterPerField.

          My current understanding tells me that to do this, the only way is to customise the IndexingChain with a new TermsHashConsumer that overrides the method TermsHashConsumer#addField(TermsHashPerField termsHashPerField, FieldInfo fieldInfo). This method addField will be able to instantiate the correct TermsHashConsumerPerField if and only if there is codec metadata in the FieldInfo parameter. That's why I am interested of using a customised FieldInfo to store codec-related metadata about a field.

          Or is there a better way to get codec-related information about a field in the IndexingChain ?

          Show
          Renaud Delbru added a comment - What codec metadata? Metadata that indicates which codec is used for a particular field. Let say I want to have a specific TermsHashConsumerPerField depending on the codec used by a field. For example, for field A and field B which use the Lucen40 codec, we need to use the FreqProxTermsWriterPerField. And for field C that uses my own specific codec, I need to use the MyOwnTermsWriterPerField. My current understanding tells me that to do this, the only way is to customise the IndexingChain with a new TermsHashConsumer that overrides the method TermsHashConsumer#addField(TermsHashPerField termsHashPerField, FieldInfo fieldInfo). This method addField will be able to instantiate the correct TermsHashConsumerPerField if and only if there is codec metadata in the FieldInfo parameter. That's why I am interested of using a customised FieldInfo to store codec-related metadata about a field. Or is there a better way to get codec-related information about a field in the IndexingChain ?
          Hide
          Robert Muir added a comment -

          Codecs are not per-field, they encode the entire inverted index segment.

          Show
          Robert Muir added a comment - Codecs are not per-field, they encode the entire inverted index segment.
          Hide
          Renaud Delbru added a comment -

          Sorry, I meant PostingsFormat instead of Codec.

          Show
          Renaud Delbru added a comment - Sorry, I meant PostingsFormat instead of Codec.
          Hide
          Robert Muir added a comment -

          Well you can do postingsFormat instanceof PerFieldPostingsFormat + postingsFormat.getPostingsFormatForField if you really want.

          But keep in mind PerFieldPostingsFormat is not really special and just one we provide for convenience, obviously one could write their own PostingsFormat
          that implements the same thing in a different way.

          Show
          Robert Muir added a comment - Well you can do postingsFormat instanceof PerFieldPostingsFormat + postingsFormat.getPostingsFormatForField if you really want. But keep in mind PerFieldPostingsFormat is not really special and just one we provide for convenience, obviously one could write their own PostingsFormat that implements the same thing in a different way.
          Hide
          Grant Ingersoll added a comment -

          Michael McCandless what's the replacement strategy for IndexFileNameFilter? I'm updating some old code from 3.x to 4.3 (MAHOUT-944) and not sure on what the equivalent approach is, or whether I even need it. (Note, I'm still trying to figure out whether the patch for MAHOUT-944 is even the best way to do what it is trying to do, but I want to at least get it compiling first)

          Show
          Grant Ingersoll added a comment - Michael McCandless what's the replacement strategy for IndexFileNameFilter? I'm updating some old code from 3.x to 4.3 ( MAHOUT-944 ) and not sure on what the equivalent approach is, or whether I even need it. (Note, I'm still trying to figure out whether the patch for MAHOUT-944 is even the best way to do what it is trying to do, but I want to at least get it compiling first)
          Hide
          Michael McCandless added a comment -

          Hi Grant Ingersoll, I think you can use the IndexFileNames.CODEC_FILE_PATTERN in your filter? You may need to add in segments_N and segments.gen as well ...

          Show
          Michael McCandless added a comment - Hi Grant Ingersoll , I think you can use the IndexFileNames.CODEC_FILE_PATTERN in your filter? You may need to add in segments_N and segments.gen as well ...
          Hide
          Grant Ingersoll added a comment -

          Hmm, Mike, CODEC_FILE_PATTERN is package access only. Easy enough to replicate/fix, any reason not too?

          Show
          Grant Ingersoll added a comment - Hmm, Mike, CODEC_FILE_PATTERN is package access only. Easy enough to replicate/fix, any reason not too?
          Hide
          Michael McCandless added a comment -

          Hmm looks like it's package private in 4.3 but is (will be) public in 4.x/trunk. Just replicate for now

          Show
          Michael McCandless added a comment - Hmm looks like it's package private in 4.3 but is (will be) public in 4.x/trunk. Just replicate for now

            People

            • Assignee:
              Robert Muir
              Reporter:
              Andrzej Bialecki
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development