Lucene - Core
  1. Lucene - Core
  2. LUCENE-2915

make CoreCodecProvider convenience class so apps can easily pick per-field codecs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We already have DefaultCodecProvider, which simply registers all core codecs and uses Standard for all fields, but it's package private.

      We should make this public, and name it CoreCodecProvider.

      1. LUCENE-2915.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Simple patch.. I'll commit shortly.

        Show
        Michael McCandless added a comment - Simple patch.. I'll commit shortly.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Mike,

        I would prefer to remove the hardcoded core providers from source code. In Java is a standard mechanism (so called Service Provider framework) that can be used to find out all codecs that ship with all given JAR files in classpath. This makes it easy to add custom codecs, you just add the JAR file to class path and it is available.

        If you like I could code the lookup code (unfortunately its only "standardized" in Java 6, but its available in a different public class since Java 1.2. It is mainly used by:

        • XML, XSLT (all of javax.xml)
        • image formats (png, gif,..)
        • ...

        In general it works very simple:
        The JAR file contains a MANIFEST that lists all classes that implement a codec under a key that is the class name of the abstract base class. A simple example is: if you plug xercesImpl.jar into your classpath, it's manifest contains a javax.xml.dom.DocumentBuilder=someClass. Based on this information DocumentuilderFactory returns a suitable implementation of a DOM parser. The same would be for Lucene, the MANIFEST of lucene-core.jar would contain a simple list of classes (all of them are returned to the provider!). If you then add the JAR file of contrib-misc to it, also the AppendingOnlyCoded would automatically be available.

        Implementation is quite simple: you can ask the service provider API for the above key (in our case a oal.index.Codec-like one) and the codec provier returns an Iterator of implementation classes. Those would get registered on Startup of DefaultCodecProvider.

        Show
        Uwe Schindler added a comment - - edited Hi Mike, I would prefer to remove the hardcoded core providers from source code. In Java is a standard mechanism (so called Service Provider framework) that can be used to find out all codecs that ship with all given JAR files in classpath. This makes it easy to add custom codecs, you just add the JAR file to class path and it is available. If you like I could code the lookup code (unfortunately its only "standardized" in Java 6, but its available in a different public class since Java 1.2. It is mainly used by: XML, XSLT (all of javax.xml) image formats (png, gif,..) ... In general it works very simple: The JAR file contains a MANIFEST that lists all classes that implement a codec under a key that is the class name of the abstract base class. A simple example is: if you plug xercesImpl.jar into your classpath, it's manifest contains a javax.xml.dom.DocumentBuilder=someClass. Based on this information DocumentuilderFactory returns a suitable implementation of a DOM parser. The same would be for Lucene, the MANIFEST of lucene-core.jar would contain a simple list of classes (all of them are returned to the provider!). If you then add the JAR file of contrib-misc to it, also the AppendingOnlyCoded would automatically be available. Implementation is quite simple: you can ask the service provider API for the above key (in our case a oal.index.Codec-like one) and the codec provier returns an Iterator of implementation classes. Those would get registered on Startup of DefaultCodecProvider.
        Hide
        Uwe Schindler added a comment -

        By the way, also Apache TIKA uses this mechanism for plugging in the list of document parsers. We can e.g. copy their impl.

        Show
        Uwe Schindler added a comment - By the way, also Apache TIKA uses this mechanism for plugging in the list of document parsers. We can e.g. copy their impl.
        Hide
        Uwe Schindler added a comment -

        See this commit from TIKA-317: http://svn.apache.org/viewvc?view=revision&revision=911225

        It also uses the Java ImageIO service provider helper classes available since Java 1.2 (and yes, also available in Android and Harmony!)

        Show
        Uwe Schindler added a comment - See this commit from TIKA-317 : http://svn.apache.org/viewvc?view=revision&revision=911225 It also uses the Java ImageIO service provider helper classes available since Java 1.2 (and yes, also available in Android and Harmony!)
        Hide
        Michael McCandless added a comment -

        This sounds like a great idea! (Automatically discovering external CPs contained in JARs in the classpath).

        If we can get that working, then I suppose we wouldn't need to hard-code our core codecs? Ie, they'd "naturally" be discovered since they are in the core JAR.

        How would it work for codecs that take args to their ctor? EG pulsing takes an int cutoff (terms w/ <= that many positions are inlined into terms dict).

        I think this should be a new issue?

        Show
        Michael McCandless added a comment - This sounds like a great idea! (Automatically discovering external CPs contained in JARs in the classpath). If we can get that working, then I suppose we wouldn't need to hard-code our core codecs? Ie, they'd "naturally" be discovered since they are in the core JAR. How would it work for codecs that take args to their ctor? EG pulsing takes an int cutoff (terms w/ <= that many positions are inlined into terms dict). I think this should be a new issue?
        Hide
        Uwe Schindler added a comment -

        Parameters to codecs are a problem. The instantiation is done by the Java SPI API (see TIKA commit). In general maybe we should only register all codecs that are needed to open also existing indexes. E.g. if you have an index that requires an ExternalSmartHuperDuperCodec, IndexWriter/Reader should complain on opening this index. The user then should simply add the required jar file and then the Index should be possible to open.

        In my opinion IndexReader should not take any config for this, but IndexWriter should maybe take a per-field config (which codec for which field). In my opinion, the whole codec configuration would then suddenly get much easier. Parameters like the pulsing paramter should be given to indexwriter in this configuration, but IndexReader should be able to read any index, as far as all referenced codecs are in classpath.

        Do I miss something? But I agree maybe we should open another issue. I can help with the SPI-impl and ANT manifest changes.

        Show
        Uwe Schindler added a comment - Parameters to codecs are a problem. The instantiation is done by the Java SPI API (see TIKA commit). In general maybe we should only register all codecs that are needed to open also existing indexes. E.g. if you have an index that requires an ExternalSmartHuperDuperCodec, IndexWriter/Reader should complain on opening this index. The user then should simply add the required jar file and then the Index should be possible to open. In my opinion IndexReader should not take any config for this, but IndexWriter should maybe take a per-field config (which codec for which field). In my opinion, the whole codec configuration would then suddenly get much easier. Parameters like the pulsing paramter should be given to indexwriter in this configuration, but IndexReader should be able to read any index, as far as all referenced codecs are in classpath. Do I miss something? But I agree maybe we should open another issue. I can help with the SPI-impl and ANT manifest changes.
        Hide
        Uwe Schindler added a comment -

        Ah, by the way, in javax.xml, you can pass key-value pairs to your xml parser, changing functionality. If the parser that was loaded by SPI does not support this, it throws Ex.

        Show
        Uwe Schindler added a comment - Ah, by the way, in javax.xml, you can pass key-value pairs to your xml parser, changing functionality. If the parser that was loaded by SPI does not support this, it throws Ex.
        Hide
        Robert Muir added a comment -

        We could create an alternative, more user-friendly codec provider impl as you describe, but still keep the old one?

        Show
        Robert Muir added a comment - We could create an alternative, more user-friendly codec provider impl as you describe, but still keep the old one?
        Hide
        Michael McCandless added a comment -

        PerFieldPostingsFormat solved this.

        Show
        Michael McCandless added a comment - PerFieldPostingsFormat solved this.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development