Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-2621. (Hoping we can do some of the renaming etc here in a rote way to make progress).

      Currently Codec.java only represents a portion of the index, but there are other parts of the index
      (stored fields, term vectors, fieldinfos, ...) that we want under codec control. There is also some
      inconsistency about what a Codec is currently, for example Memory and Pulsing are really just
      PostingsFormats, you might just apply them to a specific field. On the other hand, PreFlex actually
      is a Codec: it represents the Lucene 3.x index format (just not all parts yet). I imagine we would
      like SimpleText to be the same way.

      So, I propose restructuring the classes so that we have something like:

      • CodecProvider <-- dead, replaced by java ServiceProvider mechanism. All indexes are 'readable' if codecs are in classpath.
      • Codec <-- represents the index format (PostingsFormat + FieldsFormat + ...)
      • PostingsFormat: this is what Codec controls today, and Codec will return one of these for a field.
      • FieldsFormat: Stored Fields + Term Vectors + FieldInfos?

      I think for PreFlex, it doesnt make sense to expose its PostingsFormat as a 'public' class, because preflex
      can never be per-field so there is no use in allowing you to configure PreFlex for a specific field.
      Similarly, I think in the future we should do the same thing for SimpleText. Nobody needs SimpleText for production, it should
      just be a Codec where we try to make as much of the index as plain text and simple as possible for debugging/learning/etc.
      So we don't need to expose its PostingsFormat. On the other hand, I don't think we need Pulsing or Memory codecs,
      because its pretty silly to make your entire index use one of their PostingsFormats. To parallel with analysis:
      PostingsFormat is like Tokenizer and Codec is like Analyzer, and we don't need Analyzers to "show off" every Tokenizer.

      we can also move the baked in PerFieldCodecWrapper out (it would basically be PerFieldPostingsFormat). Privately it would
      write the ids to the file like it does today. in the future, all 3.x hairy backwards code would move to PreflexCodec.
      SimpleTextCodec would get a plain text fieldinfos impl, etc.

      1. LUCENE-3490_SPI.patch
        23 kB
        Uwe Schindler
      2. LUCENE-3490.patch
        1.75 MB
        Robert Muir
      3. lucene2621-trunk.patch
        534 kB
        selckin
      4. lucene2621-trunk-2.patch
        527 kB
        selckin
      5. lucene2621-trunk-3.patch
        559 kB
        selckin
      6. LUCENE-3490_reintegrate.patch
        328 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          thanks again for the help here guys.

          Show
          Robert Muir added a comment - thanks again for the help here guys.
          Hide
          Robert Muir added a comment -

          Thanks for the policework Uwe... I think I would like to merge to trunk later today.

          If anyone has objections just please list them, or even afterwards.

          I would rather back it out later if needed than maintain this fork much longer.

          Show
          Robert Muir added a comment - Thanks for the policework Uwe... I think I would like to merge to trunk later today. If anyone has objections just please list them, or even afterwards. I would rather back it out later if needed than maintain this fork much longer.
          Hide
          Uwe Schindler added a comment -

          +1 I see no merge problems, so no unrelated changes caused by missing merges. This is not a code review, of course.

          Show
          Uwe Schindler added a comment - +1 I see no merge problems, so no unrelated changes caused by missing merges. This is not a code review, of course.
          Hide
          Robert Muir added a comment -

          here is svn diff --no-diff-deleted after a merge --reintegrate.

          (so uwe can police)

          Show
          Robert Muir added a comment - here is svn diff --no-diff-deleted after a merge --reintegrate. (so uwe can police)
          Hide
          Robert Muir added a comment -

          ok this one is pretty good for review, thanks!

          instead before where it connected unrelated files across a rename, this one just shows
          the complete new file (as if it was created from scratch)... but thats fine for review here

          Show
          Robert Muir added a comment - ok this one is pretty good for review, thanks! instead before where it connected unrelated files across a rename, this one just shows the complete new file (as if it was created from scratch)... but thats fine for review here
          Hide
          selckin added a comment -

          Strange, i did check a hunk at one point with the svn diff to make sure it was in the right order (i always mess it up),

          Show
          selckin added a comment - Strange, i did check a hunk at one point with the svn diff to make sure it was in the right order (i always mess it up),
          Hide
          Robert Muir added a comment -

          ahh thanks... but this patch is 'reversed' (can it be changed to trunk->2621) ?

          Show
          Robert Muir added a comment - ahh thanks... but this patch is 'reversed' (can it be changed to trunk->2621) ?
          Hide
          selckin added a comment -

          yeah i turned on really aggressive rename detection, only 20% of the file needs to be the same, can tone it down a bit

          Show
          selckin added a comment - yeah i turned on really aggressive rename detection, only 20% of the file needs to be the same, can tone it down a bit
          Hide
          Robert Muir added a comment -

          thanks selckin, the problem is that patch is a little misleading in some ways:
          e.g.

          diff --git a/lucene/src/java/org/apache/lucene/index/codecs/lucene3x/Lucene3xCodec.java b/solr/core/src/java/org/apache/solr/core/SchemaCodecProvider.java
          

          These two files are totally unrelated! which one of those options to git (die git, die) is short for --pull-shit-out-of-my-ass?!

          Show
          Robert Muir added a comment - thanks selckin, the problem is that patch is a little misleading in some ways: e.g. diff --git a/lucene/src/java/org/apache/lucene/index/codecs/lucene3x/Lucene3xCodec.java b/solr/core/src/java/org/apache/solr/core/SchemaCodecProvider.java These two files are totally unrelated! which one of those options to git (die git, die) is short for --pull-shit-out-of-my-ass?!
          Hide
          selckin added a comment -

          seems mostly readable to me

          (git diff -D -B --minimal -C -M20 origin/lucene2621..trunk > lucene2621-trunk.patch)

          Show
          selckin added a comment - seems mostly readable to me (git diff -D -B --minimal -C -M20 origin/lucene2621..trunk > lucene2621-trunk.patch)
          Hide
          Uwe Schindler added a comment -

          Does not look better. It puts all moves into the patch, but still thousands of files ----- and files ++++.

          Useless for review.

          Show
          Uwe Schindler added a comment - Does not look better. It puts all moves into the patch, but still thousands of files ----- and files ++++. Useless for review.
          Hide
          Uwe Schindler added a comment -

          I'll merge and try, give me some time...

          Show
          Uwe Schindler added a comment - I'll merge and try, give me some time...
          Hide
          Robert Muir added a comment -

          I have Svn 1.7.1, there its much better. Should I create one after merging?

          If it can make a readable patch that would be great? the problem is moving src/test-framework/* to src/test-framework/java, and renaming preflex to lucene3x, etc, etc.

          Show
          Robert Muir added a comment - I have Svn 1.7.1, there its much better. Should I create one after merging? If it can make a readable patch that would be great? the problem is moving src/test-framework/* to src/test-framework/java, and renaming preflex to lucene3x, etc, etc.
          Hide
          Robert Muir added a comment -

          patch from dev-tools/scripts/diff-sources.py

          Show
          Robert Muir added a comment - patch from dev-tools/scripts/diff-sources.py
          Hide
          Uwe Schindler added a comment -

          I have Svn 1.7.1, there its much better. Should I create one after merging?

          Show
          Uwe Schindler added a comment - I have Svn 1.7.1, there its much better. Should I create one after merging?
          Hide
          Robert Muir added a comment -

          I'll create a diff. it will look ugly though, because of some svn moves we did.

          Show
          Robert Muir added a comment - I'll create a diff. it will look ugly though, because of some svn moves we did.
          Hide
          Uwe Schindler added a comment -

          +1 to merge back, maybe as always create a diff for review.

          I was only doing the SPI stuff so most work was done by Robert and Mike. I am mostly happy about the fact that you can plugins any codec to the classpath and any index can be read after that. This is how Java should work.

          Show
          Uwe Schindler added a comment - +1 to merge back, maybe as always create a diff for review. I was only doing the SPI stuff so most work was done by Robert and Mike. I am mostly happy about the fact that you can plugins any codec to the classpath and any index can be read after that. This is how Java should work.
          Hide
          Andrzej Bialecki added a comment -

          Awesome work, guys! +1 to merging it to trunk.

          Show
          Andrzej Bialecki added a comment - Awesome work, guys! +1 to merging it to trunk.
          Hide
          Robert Muir added a comment -

          tweak description to reality

          Show
          Robert Muir added a comment - tweak description to reality
          Hide
          Robert Muir added a comment -

          All the nocommits have been cleared up, the branch is synced up with trunk, and I think is ready for merging back.

          Thanks for all the help here Mike and Uwe!

          In my opinion we should merge this refactoring back, and then continue on with issues like LUCENE-2621 (making stored fields more extensible), etc.

          Show
          Robert Muir added a comment - All the nocommits have been cleared up, the branch is synced up with trunk, and I think is ready for merging back. Thanks for all the help here Mike and Uwe! In my opinion we should merge this refactoring back, and then continue on with issues like LUCENE-2621 (making stored fields more extensible), etc.
          Hide
          Uwe Schindler added a comment -

          I removed CodecProvider again in revision: 1195963

          The new code uses a simple loader mechanism that can load any no-arg ctor based Codec or PostingsFormat using a new interface NamedSPI that simply defines the getName() method.

          Show
          Uwe Schindler added a comment - I removed CodecProvider again in revision: 1195963 The new code uses a simple loader mechanism that can load any no-arg ctor based Codec or PostingsFormat using a new interface NamedSPI that simply defines the getName() method.
          Hide
          Uwe Schindler added a comment -

          remains to factor out all these little embedded codecs from tests and move them to test-framework, register them, etc.

          You can also leave them inside the tests and add a meta-inf there.

          Thanks for doing all this work Uwe... its a really good step forward... if you feel like doing the simplification I am all for it

          I think I will do it and require all codecs to have no-arg Ctor. The whole spi.* package/folder can then go away. This reduces the amount of code to 20% and makes it easier for users implementing own codecs. They just have to list the class names in META-INF. We will have 2 meta-inf files: META-INF/services/o.a.l.index.codecs.Codec and META-INF/services/o.a.l.index.codecs.PostingsFormat that list all classes - done. No more magic needed.

          I will change the branch to use this and add a Lucene40Pulsing codec (named "Lucene40Pulsing") based on abstract PulsingCodec.

          Show
          Uwe Schindler added a comment - remains to factor out all these little embedded codecs from tests and move them to test-framework, register them, etc. You can also leave them inside the tests and add a meta-inf there. Thanks for doing all this work Uwe... its a really good step forward... if you feel like doing the simplification I am all for it I think I will do it and require all codecs to have no-arg Ctor. The whole spi.* package/folder can then go away. This reduces the amount of code to 20% and makes it easier for users implementing own codecs. They just have to list the class names in META-INF. We will have 2 meta-inf files: META-INF/services/o.a.l.index.codecs.Codec and META-INF/services/o.a.l.index.codecs.PostingsFormat that list all classes - done. No more magic needed. I will change the branch to use this and add a Lucene40Pulsing codec (named "Lucene40Pulsing") based on abstract PulsingCodec.
          Hide
          Robert Muir added a comment -

          Robert: It's your turn now for test-framework, you must maybe move test-framework classpath before the core classpath to override the default 3x codec to be rw

          This works! I added TestImpersonation, and it passes.

          remains to factor out all these little embedded codecs from tests and move them to test-framework, register them, etc.

          Show
          Robert Muir added a comment - Robert: It's your turn now for test-framework, you must maybe move test-framework classpath before the core classpath to override the default 3x codec to be rw This works! I added TestImpersonation, and it passes. remains to factor out all these little embedded codecs from tests and move them to test-framework, register them, etc.
          Hide
          Robert Muir added a comment -

          If we decide for Codecs, that the part that can read an index is always implemented by only one class (PulsingPostingsFormat is currently the only problematic one) with a default ctor,

          +1

          In my opinion (i added nocommit already), its fine for Pulsing(PostingsFormat wrapped) to be abstract. But we should have Pulsing40 or something that is a concrete implementation.

          All codecs/postings formats should be able to be instantiated with a no-arg ctor for reading. If they took parameters for writing, they should also store these in the index so that the index documents what its format is... this is really important. Currently we do this with all parameters, we just need to fix pulsing. If someone wants to make PulsingHuperCodec we allow them to do this, they just make a concrete one and they are done. I think we need only provide Pulsing40 out of box and just leave the abstract wrapper API available for anyone to use themselves.

          Thanks for doing all this work Uwe... its a really good step forward... if you feel like doing the simplification I am all for it

          Show
          Robert Muir added a comment - If we decide for Codecs, that the part that can read an index is always implemented by only one class (PulsingPostingsFormat is currently the only problematic one) with a default ctor, +1 In my opinion (i added nocommit already), its fine for Pulsing(PostingsFormat wrapped) to be abstract. But we should have Pulsing40 or something that is a concrete implementation. All codecs/postings formats should be able to be instantiated with a no-arg ctor for reading. If they took parameters for writing, they should also store these in the index so that the index documents what its format is... this is really important. Currently we do this with all parameters, we just need to fix pulsing. If someone wants to make PulsingHuperCodec we allow them to do this, they just make a concrete one and they are done. I think we need only provide Pulsing40 out of box and just leave the abstract wrapper API available for anyone to use themselves. Thanks for doing all this work Uwe... its a really good step forward... if you feel like doing the simplification I am all for it
          Hide
          Uwe Schindler added a comment -

          Just some comments:
          The current impl is identical to how nio.Charset and nio.spi.CharsetProvider works in JDK's NIO classes. This differs from TIKA, where no "*Provider" class is used and all Parsers are listed in META-INF.

          For nio.Charsets we generally have lots of different charset names and many of them can be implemented by same classes with different parameters (e.g. there is a single class for all ISO8859 charsets that just gets the table name as ctor param).

          If we decide for Codecs, that the part that can read an index is always implemented by only one class (PulsingPostingsFormat is currently the only problematic one) with a default ctor, we can remove the whole CodecProvider interface and its implementations and instead use ServiceLoader<Codec> and ServiceLoader<PostigsFormat> and list all codec classes in META-INF separately (and not only the provider class). In that case the CodecLoader static helper class would initialize the map name->Codec on init and will do lookup and list of all codec names using that map. This would simplyfy the implementation, but would remove the possibility to e.g. encode parameters into codec names (e.g., "pulsingLucene40" -> new PulsingPostingsFormat(new Lucene40PostingsFormat()) versus "pulsingPFOR" -> new PulsingPostingsFormat(new PFORDeltaPostingsFormat()).

          Any suggestions? If we want to simplify and remove CodecProvider, I can recode this with minimal effort. In that case, the Chicken-And-Egg problem in Lucene40Codec using PostingsFormat.forName() would also be solved [its the same problem like the Java7-ICU-bug]

          Show
          Uwe Schindler added a comment - Just some comments: The current impl is identical to how nio.Charset and nio.spi.CharsetProvider works in JDK's NIO classes. This differs from TIKA, where no "*Provider" class is used and all Parsers are listed in META-INF. For nio.Charsets we generally have lots of different charset names and many of them can be implemented by same classes with different parameters (e.g. there is a single class for all ISO8859 charsets that just gets the table name as ctor param). If we decide for Codecs, that the part that can read an index is always implemented by only one class (PulsingPostingsFormat is currently the only problematic one) with a default ctor, we can remove the whole CodecProvider interface and its implementations and instead use ServiceLoader<Codec> and ServiceLoader<PostigsFormat> and list all codec classes in META-INF separately (and not only the provider class). In that case the CodecLoader static helper class would initialize the map name->Codec on init and will do lookup and list of all codec names using that map. This would simplyfy the implementation, but would remove the possibility to e.g. encode parameters into codec names (e.g., "pulsingLucene40" -> new PulsingPostingsFormat(new Lucene40PostingsFormat()) versus "pulsingPFOR" -> new PulsingPostingsFormat(new PFORDeltaPostingsFormat()). Any suggestions? If we want to simplify and remove CodecProvider, I can recode this with minimal effort. In that case, the Chicken-And-Egg problem in Lucene40Codec using PostingsFormat.forName() would also be solved [its the same problem like the Java7-ICU-bug]
          Hide
          Uwe Schindler added a comment -

          Committed lucene2621-branch revision 1195725.

          Robert: It's your turn now for test-framework, you must maybe move test-framework classpath before the core classpath to override the default 3x codec to be rw

          Show
          Uwe Schindler added a comment - Committed lucene2621-branch revision 1195725. Robert: It's your turn now for test-framework, you must maybe move test-framework classpath before the core classpath to override the default 3x codec to be rw
          Hide
          Uwe Schindler added a comment -

          SPI version of CodecProvider. I cannot commit this to the branch as SVN is not working... trying...

          Show
          Uwe Schindler added a comment - SPI version of CodecProvider. I cannot commit this to the branch as SVN is not working... trying...
          Hide
          Uwe Schindler added a comment -

          We should make the API like for Charsets, just replace Charset by Codec

          The Codec class gets a static method "Codec.forName(String name)", that is invoked by SegmentReader and returns the Codec instance with that name. Internally it uses java.util.ServiceLoader<org.apache.lucene.index.codec.spi.CodecProvider> to lookup codecs. Every JAR file could implement one or more org.apache.lucene.index.codecs.spi.CodecProvider that supply a lookup method and iterator (like spi.CharsetProvider). CodecProvider is just an internal class (expert) that needs to be implemented by JAR file manufacturers. For Lucene there would be one impl named CoreCodecProvider, the resource file would be located at src/resources/META-INF/services/org.apache.lucene.index.codecs.spi.CodecProvider containing only the binary class name of CoreCodecProvider.

          Contrib/misc would provide one of those files at same location, but listing ContribMiscCodecProvider as impl. We can do the same for PostingsFormats and other things in the index that needs to be looked up by name. This could be done all by CodecProvider, it could also provide lookup methods for postings,.... (this minimizes implementation cost).

          If we want to keep Preflex codec read-only in core, but RW in test-framework, the trick is simple: Just list test-framework classes/JAR before the core classes/JAR in classpath -> the test-framework CodecProvider would be seen first by ServiceLoader and take care of name "preflex", core's CodecProvider would not even be asked (as it comes later in classpath).

          Show
          Uwe Schindler added a comment - We should make the API like for Charsets, just replace Charset by Codec The Codec class gets a static method "Codec.forName(String name)", that is invoked by SegmentReader and returns the Codec instance with that name. Internally it uses java.util.ServiceLoader<org.apache.lucene.index.codec.spi.CodecProvider> to lookup codecs. Every JAR file could implement one or more org.apache.lucene.index.codecs.spi.CodecProvider that supply a lookup method and iterator (like spi.CharsetProvider). CodecProvider is just an internal class (expert) that needs to be implemented by JAR file manufacturers. For Lucene there would be one impl named CoreCodecProvider, the resource file would be located at src/resources/META-INF/services/org.apache.lucene.index.codecs.spi.CodecProvider containing only the binary class name of CoreCodecProvider. Contrib/misc would provide one of those files at same location, but listing ContribMiscCodecProvider as impl. We can do the same for PostingsFormats and other things in the index that needs to be looked up by name. This could be done all by CodecProvider, it could also provide lookup methods for postings,.... (this minimizes implementation cost). If we want to keep Preflex codec read-only in core, but RW in test-framework, the trick is simple: Just list test-framework classes/JAR before the core classes/JAR in classpath -> the test-framework CodecProvider would be seen first by ServiceLoader and take care of name "preflex", core's CodecProvider would not even be asked (as it comes later in classpath).
          Hide
          Robert Muir added a comment -

          Thanks Uwe, I think once we get tests passing in the branch we should look at this. Maybe we merge it, and just recreate the branch to fix it, or
          maybe we fix it before merging... doesn't matter to me.

          I think this would simplify a lot of stuff in lucene too:

          • tests-framework currently contains additional codecs that are used only in testing, this would work much nicer
            because when running tests the test codecs would get loaded.
          • luke would be able to read your index, as long as you had the right stuff in classpath. e.g. if i am debugging
            a test fail and I want to inspect the index with luke, it should all just work as the tests codecs would be available.
          • we could remove codecprovider arguments to all these reading apis (e.g. IndexReader). This is dumb right now,
            its almost like fixing default charset problems trying to track down how many places are using CodecProvider.getDefault(),
            when you have some custom codecs and you only want the String->Codec mappings to be visible so that your index can be read!
          Show
          Robert Muir added a comment - Thanks Uwe, I think once we get tests passing in the branch we should look at this. Maybe we merge it, and just recreate the branch to fix it, or maybe we fix it before merging... doesn't matter to me. I think this would simplify a lot of stuff in lucene too: tests-framework currently contains additional codecs that are used only in testing, this would work much nicer because when running tests the test codecs would get loaded. luke would be able to read your index, as long as you had the right stuff in classpath. e.g. if i am debugging a test fail and I want to inspect the index with luke, it should all just work as the tests codecs would be available. we could remove codecprovider arguments to all these reading apis (e.g. IndexReader). This is dumb right now, its almost like fixing default charset problems trying to track down how many places are using CodecProvider.getDefault(), when you have some custom codecs and you only want the String->Codec mappings to be visible so that your index can be read!
          Hide
          Uwe Schindler added a comment - - edited

          A good example how this could work is Unicode Policeman's java.nio.charset.Charset and java.nio.charset.spi.CharsetProvider. We can copy lots of stuff from there. The lookup of a codec by string-name is exactly like looking up a charset by name. Also for Luke we can provide the iterator of all codecs available, that appends all registered SPIs codecs. And so on.

          Once Robert have cleaned up the branch I will happily help here. The code is already in my head, I just need a working base to implement it.

          NOTE: As we are on Java 1.6 for trunk, we can use the now public java.util.ServiceLoader discovery and don't need to use sun.misc or java.awt.image.spi classes.

          Show
          Uwe Schindler added a comment - - edited A good example how this could work is Unicode Policeman's java.nio.charset.Charset and java.nio.charset.spi.CharsetProvider. We can copy lots of stuff from there. The lookup of a codec by string-name is exactly like looking up a charset by name. Also for Luke we can provide the iterator of all codecs available, that appends all registered SPIs codecs. And so on. Once Robert have cleaned up the branch I will happily help here. The code is already in my head, I just need a working base to implement it. NOTE: As we are on Java 1.6 for trunk, we can use the now public java.util.ServiceLoader discovery and don't need to use sun.misc or java.awt.image.spi classes.
          Hide
          Robert Muir added a comment -

          Another thing, I would like to remove CodecProvider completely.

          I think Uwe has a good idea in LUCENE-2915.

          The idea would be that we use this for index reading, when we read the
          codec name from the segments file, we use the ServiceLoader mechanism
          to map that back to a Codec. Parameters should be no problem, when reading
          all codecs should be able to use a no-arg ctor and any necessary parameters
          should really be written into the index (otherwise the index does not document
          how it should be read!)

          For writing, I think IndexWriter would just optionally take Codec, what is to
          be used for writing new segments.

          Show
          Robert Muir added a comment - Another thing, I would like to remove CodecProvider completely. I think Uwe has a good idea in LUCENE-2915 . The idea would be that we use this for index reading, when we read the codec name from the segments file, we use the ServiceLoader mechanism to map that back to a Codec. Parameters should be no problem, when reading all codecs should be able to use a no-arg ctor and any necessary parameters should really be written into the index (otherwise the index does not document how it should be read!) For writing, I think IndexWriter would just optionally take Codec, what is to be used for writing new segments.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development