Lucene - Core
  1. Lucene - Core
  2. LUCENE-3560

add extra safety to concrete codec implementations

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
      and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that
      nature, only IndexWriter.

      Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's
      Service Provider Mechanism.

      So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
      to IndexWriter), this enforces that they must write any configuration information they need into
      the index, so that we don't have a flimsy API.

      I think we should go even further, for additional safety. Any methods on our concrete codecs that
      are not intended to be subclassed should be final, and we should add assertions to verify this.

      For example, SimpleText's files() implementation should be final. If you want to make an extension
      of simpletext that has additional files, then this is a different index format and should have a
      different name!

      Note: This doesn't stop extensibility, only stupid mistakes.
      For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though
      it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings
      formats (which it writes into a .per file into the index, so that it knows how to read each field).

      private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
        @Override
        public PostingsFormat getPostingsFormatForField(String field) {
          return Lucene40Codec.this.getPostingsFormatForField(field);
        }
      };
      
      ...
      
      @Override
      public final PostingsFormat postingsFormat() {
        return postingsFormat;
      }
      
      ...
      
        /** Returns the postings format that should be used for writing 
         *  new segments of <code>field</code>.
         *  
         *  The default implementation always returns "Lucene40"
         */
        public PostingsFormat getPostingsFormatForField(String field) {
          return defaultFormat;
        }
      

        Activity

        Hide
        Simon Willnauer added a comment -

        +1

        Show
        Simon Willnauer added a comment - +1
        Hide
        Uwe Schindler added a comment -

        Heavy reflecting in the good old TokenStream assertFinal (I was so unhappy when Analyzer was restructured that it had to go there... g).

        Some comments:

        • We should maybe also add a check that there is at least a default constructor available. this.getClass().getConstructor() does not throw exception
        • In general, subclassing a Codec or a PostingsFormat is wrong (except the Lucene3x hack). If you subclass a codec/PF, you can no longer change it's name. So anybody who subclasses a codec will produce a clone with the same name but perhaps another index format. This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format?
        • I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!].
        Show
        Uwe Schindler added a comment - Heavy reflecting in the good old TokenStream assertFinal (I was so unhappy when Analyzer was restructured that it had to go there... g ). Some comments: We should maybe also add a check that there is at least a default constructor available. this.getClass().getConstructor() does not throw exception In general, subclassing a Codec or a PostingsFormat is wrong (except the Lucene3x hack). If you subclass a codec/PF, you can no longer change it's name. So anybody who subclasses a codec will produce a clone with the same name but perhaps another index format. This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format? I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!] .
        Hide
        Robert Muir added a comment -

        This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format?

        A codec can take parameters or have a non-final "hook": just like the perfield stuff does... (and it records this stuff into the index, so its fine). This is similar to CharTokenizer

        I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!].

        Yeah, it would be nice to clean up the preflex hack...

        Show
        Robert Muir added a comment - This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format? A codec can take parameters or have a non-final "hook": just like the perfield stuff does... (and it records this stuff into the index, so its fine). This is similar to CharTokenizer I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!] . Yeah, it would be nice to clean up the preflex hack...
        Hide
        Andrzej Bialecki added a comment -

        Let's not go too far either... we want people to write and contribute new codecs, if we make the api too onerous to use then we risk a lot of copy&paste-ing (e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now). So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions").

        Show
        Andrzej Bialecki added a comment - Let's not go too far either... we want people to write and contribute new codecs, if we make the api too onerous to use then we risk a lot of copy&paste-ing (e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now). So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions").
        Hide
        Robert Muir added a comment -

        e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now

        But this is a must: you made a new index format, it needs a name to differentiate itself from the existing codec, otherwise it won't work.

        Show
        Robert Muir added a comment - e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now But this is a must: you made a new index format, it needs a name to differentiate itself from the existing codec, otherwise it won't work.
        Hide
        Robert Muir added a comment -

        So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions").

        And thats exactly what this patch does: it only allows extensibility where there is no harm. The problem is, in trunk today, no methods on any codecs are final (and some should be!).

        The places where there is no harm: e.g. getPostingsFormatForField, are still left open. This patch doesn't stop you from doing anything you can't already do today.

        It only stops code that was already doomed to fail at runtime from even compiling.

        Show
        Robert Muir added a comment - So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions"). And thats exactly what this patch does: it only allows extensibility where there is no harm. The problem is, in trunk today, no methods on any codecs are final (and some should be!). The places where there is no harm: e.g. getPostingsFormatForField, are still left open. This patch doesn't stop you from doing anything you can't already do today. It only stops code that was already doomed to fail at runtime from even compiling.
        Hide
        Uwe Schindler added a comment - - edited

        I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now

        The abstract base class Codec is as stupid simple as Analyzer. There is no logic in it, it just defines the following:

        • name of codec (which cannot be changed by subclassing!!!)
        • factory methods for the format readers/writers of the different parts of an index (postings, stored fields, segments file,...)

        If you want to create a new codec, you have to simply write this wrapper with a new name, otherwise SPI won't work.

        The first point in the above list is the real bummer in your "I only want to add one file" approach. If you would subclass the codec, the name cannot change anymore. This name is written to the index format. When you open IndexReader, it reads the name and uses Codec.forName() to lookup the codec. As the name was not changed in your subclass it would then use the superclass to read the index -> #fail

        Show
        Uwe Schindler added a comment - - edited I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now The abstract base class Codec is as stupid simple as Analyzer. There is no logic in it, it just defines the following: name of codec (which cannot be changed by subclassing!!!) factory methods for the format readers/writers of the different parts of an index (postings, stored fields, segments file,...) If you want to create a new codec, you have to simply write this wrapper with a new name, otherwise SPI won't work. The first point in the above list is the real bummer in your "I only want to add one file" approach. If you would subclass the codec, the name cannot change anymore. This name is written to the index format. When you open IndexReader, it reads the name and uses Codec.forName() to lookup the codec. As the name was not changed in your subclass it would then use the superclass to read the index -> #fail

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development