Lucene - Core
  1. Lucene - Core
  2. LUCENE-4399

Rename AppendingCodec to Appending40Codec

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In order AppendingCodec to follow Lucene codecs version, I think its name should include a version number (so that, for example, if we get to releave Lucene 4.3 with a new Lucene43Codec, there will also be a new Appending43Codec).

      1. LUCENE-4399.patch
        21 kB
        Adrien Grand
      2. LUCENE-4399.patch
        5 kB
        Robert Muir
      3. LUCENE-4399.patch
        6 kB
        Robert Muir
      4. LUCENE-4399.patch
        47 kB
        Robert Muir
      5. LUCENE-4399.patch
        51 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I'm not sure of the need for this: we don't need to provide back compat for anything in the codecs/ module,
        just the default codec?

        Show
        Robert Muir added a comment - I'm not sure of the need for this: we don't need to provide back compat for anything in the codecs/ module, just the default codec?
        Hide
        Robert Muir added a comment -

        I wasn't saying i was totally against the idea btw, i was just trying to invite some discussion..., just to explain
        my rationale:

        Personally I dont think its that great we have version numbers in our default codec, I was the one that proposed this
        because its simple on our end as Lucene developers to implement backwards compatibility this way: but it encourages
        some code duplication and stuff like that: which is in my mind, a lesser evil than conditionals inside one "mega-impl"
        like before, that must handle different binary formats.

        And i think this is pretty contained to expert users: most people will just create an indexwriter and be oblivious to this.

        As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to
        try to keep these implementations simple. If we start adding versioning and back compat to them, then we are basically
        expanding our backwards compatibility commitments to N versions * M formats, which is something that requires a very
        big discussion.

        Show
        Robert Muir added a comment - I wasn't saying i was totally against the idea btw, i was just trying to invite some discussion..., just to explain my rationale: Personally I dont think its that great we have version numbers in our default codec, I was the one that proposed this because its simple on our end as Lucene developers to implement backwards compatibility this way: but it encourages some code duplication and stuff like that: which is in my mind, a lesser evil than conditionals inside one "mega-impl" like before, that must handle different binary formats. And i think this is pretty contained to expert users: most people will just create an indexwriter and be oblivious to this. As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to try to keep these implementations simple. If we start adding versioning and back compat to them, then we are basically expanding our backwards compatibility commitments to N versions * M formats, which is something that requires a very big discussion.
        Hide
        Adrien Grand added a comment -

        which is in my mind, a lesser evil than conditionals inside one "mega-impl" like before

        Agreed.

        As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to try to keep these implementations simple.

        I think we should do our best so that backwards-compatiblity issues in lucene-codecs don't happen because of changes in the default Lucene codec or postings format.

        For example, if we release a new Lucene43PostingsFormat, we will either need to create a new Pulsing43PostingsFormat (to maintain backwards compatibility) or just replace the current Pulsing40PostingsFormat with Pulsing43PostingsFormat (if we don't care).

        Maybe we could remove Pulsing40PostingsFormat and modify PulsingPostingsFormat so that it:

        • wraps the current default postings format by default,
        • writes the wrapped format name at indexing/merge time,
        • and then uses it back at read time (similarly to what BloomFilter does).

        The only modification required if we release a new Lucene43PostingsFormat will be to change the default wrapped postings format.

        I think we should be able to do this with AppendingCodec and DirectPostingsFormat too.

        Show
        Adrien Grand added a comment - which is in my mind, a lesser evil than conditionals inside one "mega-impl" like before Agreed. As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to try to keep these implementations simple. I think we should do our best so that backwards-compatiblity issues in lucene-codecs don't happen because of changes in the default Lucene codec or postings format. For example, if we release a new Lucene43PostingsFormat, we will either need to create a new Pulsing43PostingsFormat (to maintain backwards compatibility) or just replace the current Pulsing40PostingsFormat with Pulsing43PostingsFormat (if we don't care). Maybe we could remove Pulsing40PostingsFormat and modify PulsingPostingsFormat so that it: wraps the current default postings format by default, writes the wrapped format name at indexing/merge time, and then uses it back at read time (similarly to what BloomFilter does). The only modification required if we release a new Lucene43PostingsFormat will be to change the default wrapped postings format. I think we should be able to do this with AppendingCodec and DirectPostingsFormat too.
        Hide
        Robert Muir added a comment -

        It would be great to simplify Pulsing the way that you suggest!

        But I have a tough time with it, it doesnt actually wrap any arbitrary postings,
        instead only ones that extend "PostingsBaseFormat" (and this extension is a little confusing,
        any ideas towards refactoring it to be cleaner I think would be helpful).

        If you ask me right now, I don't remember the reason why Pulsing needs such a low-level
        interaction (versus just working on the higher-level e.g. termsenum apis).

        Show
        Robert Muir added a comment - It would be great to simplify Pulsing the way that you suggest! But I have a tough time with it, it doesnt actually wrap any arbitrary postings, instead only ones that extend "PostingsBaseFormat" (and this extension is a little confusing, any ideas towards refactoring it to be cleaner I think would be helpful). If you ask me right now, I don't remember the reason why Pulsing needs such a low-level interaction (versus just working on the higher-level e.g. termsenum apis).
        Hide
        Michael McCandless added a comment -

        +1 to fixing any wrapped PostingsFormats to save the name of what they wrapped into the index.

        Pulsing currently wraps a PostingsBaseFormat because it needs to "intervene" on a term by term basis on the communication b/w the terms dict and the postings format it wraps. It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how.

        Separately I really don't like the name PostingsBaseFormat But I can't think of something better ... it's basically the PostingsFormat minus the terms dict.

        Show
        Michael McCandless added a comment - +1 to fixing any wrapped PostingsFormats to save the name of what they wrapped into the index. Pulsing currently wraps a PostingsBaseFormat because it needs to "intervene" on a term by term basis on the communication b/w the terms dict and the postings format it wraps. It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how. Separately I really don't like the name PostingsBaseFormat But I can't think of something better ... it's basically the PostingsFormat minus the terms dict.
        Hide
        Adrien Grand added a comment -

        It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how.

        That would be great. This way we wouldn't need to create a SPI loader for PostingsBaseFormat in order to let Pulsing store/restore the wrapped PostingsBaseFormat.

        But I can't think of something better ... it's basically the PostingsFormat minus the terms dict.

        So maybe it is PostingsFormat that should be renamed (PostingsAndTermsFormat?)

        Show
        Adrien Grand added a comment - It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how. That would be great. This way we wouldn't need to create a SPI loader for PostingsBaseFormat in order to let Pulsing store/restore the wrapped PostingsBaseFormat. But I can't think of something better ... it's basically the PostingsFormat minus the terms dict. So maybe it is PostingsFormat that should be renamed (PostingsAndTermsFormat?)
        Hide
        Adrien Grand added a comment -

        This patch attemps to fix the problem for the Appending codec and the Direct postings format (BloomFilter already serialized the name of its wrapped postings format).

        I am not fully satisfied with the way I handle AppendingCodec, suggestions are welcome.

        I had a look at Pulsing but this looks complicated (to me at least)... For Pulsing, should we rather:

        • use a SPI loader for PostingsBaseFormat,
        • or leave it this way for 4.0?
        Show
        Adrien Grand added a comment - This patch attemps to fix the problem for the Appending codec and the Direct postings format (BloomFilter already serialized the name of its wrapped postings format). I am not fully satisfied with the way I handle AppendingCodec, suggestions are welcome. I had a look at Pulsing but this looks complicated (to me at least)... For Pulsing, should we rather: use a SPI loader for PostingsBaseFormat, or leave it this way for 4.0?
        Hide
        Robert Muir added a comment -

        I dont want to use an SPI loader for postingsbaseformat.

        i dont even like postingsbaseformat

        SPI is really heavy and we should avoid it, I think we just need to look at how we can refactor
        this thing so it works the way we want.

        Show
        Robert Muir added a comment - I dont want to use an SPI loader for postingsbaseformat. i dont even like postingsbaseformat SPI is really heavy and we should avoid it, I think we just need to look at how we can refactor this thing so it works the way we want.
        Hide
        Robert Muir added a comment -

        Hmm I'm confused by the changes to AppendingCodec: its not really a codec-wrapper
        its just coded that way.

        Actually the whole thing is very dependent on the implementation details of the current
        term dictionary right? thats the only place that currently seeks, where it changes
        the behavior.

        so we have a number of alternatives:

        • AppendingCodec extends Lucene40Codec and returns AppendingPF for every field.
        • remove AppendingCodec entirely (just move to test-framework), because someone could do the above themselves.
        Show
        Robert Muir added a comment - Hmm I'm confused by the changes to AppendingCodec: its not really a codec-wrapper its just coded that way. Actually the whole thing is very dependent on the implementation details of the current term dictionary right? thats the only place that currently seeks, where it changes the behavior. so we have a number of alternatives: AppendingCodec extends Lucene40Codec and returns AppendingPF for every field. remove AppendingCodec entirely (just move to test-framework), because someone could do the above themselves.
        Hide
        Robert Muir added a comment -

        I also don't like the .dlg file etc. I think if you want to record the inner codec name
        you should just read/write a SegmentInfo attribute.

        Show
        Robert Muir added a comment - I also don't like the .dlg file etc. I think if you want to record the inner codec name you should just read/write a SegmentInfo attribute.
        Hide
        Robert Muir added a comment -

        Thinking about this more, in my opinion it could be overkill for e.g. AppendingPF to do this.

        We should forget about AppendingPF and AppendingCodec and just look at what it really is:
        a wrapper around BlockTree.

        In my opinion the easy win here is to ensure that if we change BlockTree, Appending correctly
        gets IndexTooOld/IndexTooNew exceptions.

        Show
        Robert Muir added a comment - Thinking about this more, in my opinion it could be overkill for e.g. AppendingPF to do this. We should forget about AppendingPF and AppendingCodec and just look at what it really is: a wrapper around BlockTree. In my opinion the easy win here is to ensure that if we change BlockTree, Appending correctly gets IndexTooOld/IndexTooNew exceptions.
        Hide
        Adrien Grand added a comment -

        AppendingCodec extends Lucene40Codec and returns AppendingPF for every field.

        But I think we would like it to extend Lucene43Codec when we release it and this will break back compat for Appending?

        remove AppendingCodec entirely (just move to test-framework), because someone could do the above themselves.

        I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec.

        +1 for your patch

        Show
        Adrien Grand added a comment - AppendingCodec extends Lucene40Codec and returns AppendingPF for every field. But I think we would like it to extend Lucene43Codec when we release it and this will break back compat for Appending? remove AppendingCodec entirely (just move to test-framework), because someone could do the above themselves. I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec. +1 for your patch
        Hide
        Robert Muir added a comment -

        I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec.

        Right my only hesitation before here was 'exposing our guts' a bit too much. For example, this scheme breaks if in lucene 4.2
        you write a fancy new default stored fields format that needs to seek-on-write.

        But I think we shouldnt worry about this: we can just address it as it comes (and if a Codec makes sense then, lets deal with it).

        Its like the analysis package, if we implement something as a TokenFilter thats ok. We don't necessarily need to hide ourselves
        by providing an Analyzer too. If we want to make it a Tokenizer later because that makes more sense, then we just do that

        Otherwise we will have a lot of delegates and complex code for only theoretical future situations and I think we should avoid this.

        Show
        Robert Muir added a comment - I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec. Right my only hesitation before here was 'exposing our guts' a bit too much. For example, this scheme breaks if in lucene 4.2 you write a fancy new default stored fields format that needs to seek-on-write. But I think we shouldnt worry about this: we can just address it as it comes (and if a Codec makes sense then, lets deal with it). Its like the analysis package, if we implement something as a TokenFilter thats ok. We don't necessarily need to hide ourselves by providing an Analyzer too. If we want to make it a Tokenizer later because that makes more sense, then we just do that Otherwise we will have a lot of delegates and complex code for only theoretical future situations and I think we should avoid this.
        Hide
        Michael McCandless added a comment -

        +1 for the patch (correctly detecting format changes & throwing TooNew/Old) and for moving AppendingCodec to test-framework.

        Show
        Michael McCandless added a comment - +1 for the patch (correctly detecting format changes & throwing TooNew/Old) and for moving AppendingCodec to test-framework.
        Hide
        Adrien Grand added a comment -

        I think we just need to look at how we can refactor this thing so it works the way we want.

        What about having a public subclass of PostingsFormat that would enforce the decoupling of fields/terms and postings? Pulsing could only wrap instances of this subclass and would throw an exception when loading the segment if the declared delegate postings format does not extend this class. (Just thinking out loud...)

        Show
        Adrien Grand added a comment - I think we just need to look at how we can refactor this thing so it works the way we want. What about having a public subclass of PostingsFormat that would enforce the decoupling of fields/terms and postings? Pulsing could only wrap instances of this subclass and would throw an exception when loading the segment if the declared delegate postings format does not extend this class. (Just thinking out loud...)
        Hide
        Adrien Grand added a comment -

        Robert Muir Maybe your last patch could/should be committed to 4.0?

        Show
        Adrien Grand added a comment - Robert Muir Maybe your last patch could/should be committed to 4.0?
        Hide
        Robert Muir added a comment -

        I feel like the safety might be a good idea (even though I hate the idea of doing it now).

        Would be good to see what anyone else thinks about this.

        I don't like that the Codec check is currently "false sense of security" if we change BlockTree.

        Show
        Robert Muir added a comment - I feel like the safety might be a good idea (even though I hate the idea of doing it now). Would be good to see what anyone else thinks about this. I don't like that the Codec check is currently "false sense of security" if we change BlockTree.
        Hide
        Robert Muir added a comment -

        Updated patch for 4.1:

        • fixes blocktree not to seek on write
        • changes are backwards compatible
        • we can deprecate/remove appending codec, we should just keep its test.

        Personally I think i like this better.

        Show
        Robert Muir added a comment - Updated patch for 4.1: fixes blocktree not to seek on write changes are backwards compatible we can deprecate/remove appending codec, we should just keep its test. Personally I think i like this better.
        Hide
        Adrien Grand added a comment -
        +        if (indexVersion != version) {
        +          throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
        +        }
        

        If we force the terms version and the terms index version to be the same, maybe we should merge TERMS_INDEX_VERSION_CURRENT and TERMS_VERSION_CURRENT into a single constant?

        Show
        Adrien Grand added a comment - + if (indexVersion != version) { + throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion); + } If we force the terms version and the terms index version to be the same, maybe we should merge TERMS_INDEX_VERSION_CURRENT and TERMS_VERSION_CURRENT into a single constant?
        Hide
        Robert Muir added a comment -

        We dont force them to be the same. But to date they have been, and its currently still the case, so i check it.

        in the future its possible we might need to bump something about just the terms index file or whatever.

        Show
        Robert Muir added a comment - We dont force them to be the same. But to date they have been, and its currently still the case, so i check it. in the future its possible we might need to bump something about just the terms index file or whatever.
        Hide
        Adrien Grand added a comment -

        If it makes sense to check that the versions of both files are consistent, maybe we should still merge the version numbers into a single one and bump it whenever any of the file formats changes? (similarly to Lucene40StoredFieldsWriter)

        Show
        Adrien Grand added a comment - If it makes sense to check that the versions of both files are consistent, maybe we should still merge the version numbers into a single one and bump it whenever any of the file formats changes? (similarly to Lucene40StoredFieldsWriter)
        Hide
        Robert Muir added a comment -

        I dont its a goal to check that they are really consistent? Really we can remove this check.

        I think every file should have its own name and version number. If XYZ writer has 10 files, then
        it should have 10 independent codec names and version numbers.

        This was a bug in this stuff before (LUCENE-3621).

        For the stored fields writer i'm not pedantic enough to go fix the fact it shares version numbers,
        but I'm just saying we really have to always look at this conceptually as a per-file thing.

        Otherwise we defeat a lot of the purpose of having a codec header at all (verifying this file really
        is what you think it is, and the version you think it is).

        Show
        Robert Muir added a comment - I dont its a goal to check that they are really consistent? Really we can remove this check. I think every file should have its own name and version number. If XYZ writer has 10 files, then it should have 10 independent codec names and version numbers. This was a bug in this stuff before ( LUCENE-3621 ). For the stored fields writer i'm not pedantic enough to go fix the fact it shares version numbers, but I'm just saying we really have to always look at this conceptually as a per-file thing. Otherwise we defeat a lot of the purpose of having a codec header at all (verifying this file really is what you think it is, and the version you think it is).
        Hide
        Adrien Grand added a comment -

        I dont its a goal to check that they are really consistent? Really we can remove this check.

        I actually started to like it. Having two different version numbers in the headers of these files would mean that something went really wrong...

        This was a bug in this stuff before (LUCENE-3621).

        I agree that codec names should never be shared, but I don't think this is a problem for version numbers.

        I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch.

        we can deprecate/remove appending codec, we should just keep its test.

        +1 for moving AppendingCodec from lucene-codecs to test-framework too

        Show
        Adrien Grand added a comment - I dont its a goal to check that they are really consistent? Really we can remove this check. I actually started to like it. Having two different version numbers in the headers of these files would mean that something went really wrong... This was a bug in this stuff before ( LUCENE-3621 ). I agree that codec names should never be shared, but I don't think this is a problem for version numbers. I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch. we can deprecate/remove appending codec, we should just keep its test. +1 for moving AppendingCodec from lucene-codecs to test-framework too
        Hide
        Robert Muir added a comment -

        I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch.

        Yeah, its a little funky, because of how there is only "one" seekDir, and this is protected so Appending can subclass it (we can remove this when Appending goes away).

        So that one seekDir (used for both files) needs to know what should happen: I suppose it could be 'boolean append' or something to be more clear.

        I didnt want to have an assert though, or it might invoke Uwe's wrath!

        Show
        Robert Muir added a comment - I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch. Yeah, its a little funky, because of how there is only "one" seekDir, and this is protected so Appending can subclass it (we can remove this when Appending goes away). So that one seekDir (used for both files) needs to know what should happen: I suppose it could be 'boolean append' or something to be more clear. I didnt want to have an assert though, or it might invoke Uwe's wrath!
        Hide
        Robert Muir added a comment -

        A more thorough patch for trunk. (Needs javadocs and minor cleanups etc etc, but all backwards compat and such).

        I would deprecate this stuff in branch_4x still.

        I would also, in branch_4x, include my original fix on this issue for not including blocktree's versioning. this itself can also be backwards compat (its just an if statement).

        Show
        Robert Muir added a comment - A more thorough patch for trunk. (Needs javadocs and minor cleanups etc etc, but all backwards compat and such). I would deprecate this stuff in branch_4x still. I would also, in branch_4x, include my original fix on this issue for not including blocktree's versioning. this itself can also be backwards compat (its just an if statement).
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Robert Muir added a comment -

        updated patch (just docs-only). If the .tip part looks confusing, its just because the current doc for the terms index format was missing any mention of DirOffset.

        Show
        Robert Muir added a comment - updated patch (just docs-only). If the .tip part looks confusing, its just because the current doc for the terms index format was missing any mention of DirOffset.
        Hide
        Robert Muir added a comment -

        Any other opinions on this patch?

        Show
        Robert Muir added a comment - Any other opinions on this patch?
        Hide
        Adrien Grand added a comment -

        +1 Maybe we should add a note in Lucene40Codec docs saying that this codec works on append-only file-systems?

        Show
        Adrien Grand added a comment - +1 Maybe we should add a note in Lucene40Codec docs saying that this codec works on append-only file-systems?
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1396049

        LUCENE-4399: Deprecate AppendingCodec

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1396049 LUCENE-4399 : Deprecate AppendingCodec
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development