Solr
  1. Solr
  2. SOLR-752

Allow better Field Compression options

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      See http://lucene.markmail.org/message/sd4mgwud6caevb35?q=compression

      It would be good if Solr handled field compression outside of Lucene's Field.COMPRESS capabilities, since those capabilities are less than ideal when it comes to control over compression.

      1. compressedtextfield.patch
        15 kB
        Simon Jacob
      2. compressed_field.patch
        18 kB
        Kim Taylor

        Activity

        Hide
        David Smiley added a comment -

        I spent some time today attempting to implement this with my own Solr FieldType that extends TextField. As I tried to implement it, I realized that I couldn't really do it. FieldType has a method createField(...) that is necessary to implement in order to set binary data (i.e. byte[]) on a Field. This method demands I return a org.apache.lucene.document.Field which is final. If I create the field with binary data, by default it's not indexed or tokenized. I can get those booleans to flip by simply invoking f.setTokenStream(null). However, I can't set omitNorms() to false, nor can I set booleans for the term vector fields. There may be other issues but at this point I gave up to work on other more important priorities of mine.

        Show
        David Smiley added a comment - I spent some time today attempting to implement this with my own Solr FieldType that extends TextField. As I tried to implement it, I realized that I couldn't really do it. FieldType has a method createField(...) that is necessary to implement in order to set binary data (i.e. byte[]) on a Field. This method demands I return a org.apache.lucene.document.Field which is final. If I create the field with binary data, by default it's not indexed or tokenized. I can get those booleans to flip by simply invoking f.setTokenStream(null). However, I can't set omitNorms() to false, nor can I set booleans for the term vector fields. There may be other issues but at this point I gave up to work on other more important priorities of mine.
        Hide
        Yonik Seeley added a comment -

        We don't need to maintain strict back-compat on trunk... all methods that take and return Field objects should be changed to accept Fieldable. That might help.

        You might also want to look at how TrieField does things, which stores the numbers in binary while still indexing it.

        Show
        Yonik Seeley added a comment - We don't need to maintain strict back-compat on trunk... all methods that take and return Field objects should be changed to accept Fieldable. That might help. You might also want to look at how TrieField does things, which stores the numbers in binary while still indexing it.
        Hide
        David Smiley added a comment -

        I already looked at BinaryField and TrieField for inspiration. BinaryField assumes you're not going to index the data. And TrieField doesn't set binary data value on the Field.

        Yes, I think the next step is to make createField() return Fieldable. But I'm not a committer...

        Instead or in addition... I have to wonder, why not modify Lucene's Field class to allow me to set the Index, Store, and TermVecotr enums AND specify binary data on a suitable constructor? Arguably an existing constructor taking String would be hijaced to take Object and then do the right thing. That would be a small change, whereas implementing another subclass of AbstractField is more complex and would likely reproduce much of what's in Field already.

        Show
        David Smiley added a comment - I already looked at BinaryField and TrieField for inspiration. BinaryField assumes you're not going to index the data. And TrieField doesn't set binary data value on the Field. Yes, I think the next step is to make createField() return Fieldable. But I'm not a committer... Instead or in addition... I have to wonder, why not modify Lucene's Field class to allow me to set the Index, Store, and TermVecotr enums AND specify binary data on a suitable constructor? Arguably an existing constructor taking String would be hijaced to take Object and then do the right thing. That would be a small change, whereas implementing another subclass of AbstractField is more complex and would likely reproduce much of what's in Field already.
        Hide
        Simon Jacob added a comment -

        I have attempted to add support for compressed stored fields by implementing a new FieldType. ie:

        <fieldtype name="compressed_text" class="solr.CompressedTextField" compressionlevel="5" compressThreshold="100">

        It supports setting both the compressionLevel and a compressionThreshold. The implementation does have a couple of limitations at the moment. First, it is not compatible with Field.Store.COMPRESS so existing indexes will have to be rebuilt. Second, highlighting on this fieldtype does not appear to work. Possibly because DefaultSolrHighlighter using doc.getValues(fieldName) to retrieve the stored value but I haven't fully checked.

        Show
        Simon Jacob added a comment - I have attempted to add support for compressed stored fields by implementing a new FieldType. ie: <fieldtype name="compressed_text" class="solr.CompressedTextField" compressionlevel="5" compressThreshold="100"> It supports setting both the compressionLevel and a compressionThreshold. The implementation does have a couple of limitations at the moment. First, it is not compatible with Field.Store.COMPRESS so existing indexes will have to be rebuilt. Second, highlighting on this fieldtype does not appear to work. Possibly because DefaultSolrHighlighter using doc.getValues(fieldName) to retrieve the stored value but I haven't fully checked.
        Hide
        Kim Taylor added a comment -

        I've had a look into this. Simon is right, DefaultSolrHighlighter uses doc.getValues(fieldName) to retrieve the field. lucene.document.Document.getValues() calls stringValue on the appropriate field. The problem is that when FieldsWriter/Reader read fields from segments, the supplied CompressedField gets converted into a Field, which does not know how to interpret fieldsData. I've added another patch that alters DefaultSolrHighlighter to use the schema FieldType (in this case CompressedField) to properly interpred fieldsData.

        Show
        Kim Taylor added a comment - I've had a look into this. Simon is right, DefaultSolrHighlighter uses doc.getValues(fieldName) to retrieve the field. lucene.document.Document.getValues() calls stringValue on the appropriate field. The problem is that when FieldsWriter/Reader read fields from segments, the supplied CompressedField gets converted into a Field, which does not know how to interpret fieldsData. I've added another patch that alters DefaultSolrHighlighter to use the schema FieldType (in this case CompressedField) to properly interpred fieldsData.
        Hide
        Kim Taylor added a comment -

        Patch updated to modify DefaultSolrHighlighter

        Show
        Kim Taylor added a comment - Patch updated to modify DefaultSolrHighlighter
        Hide
        Pieter added a comment -

        Doesn't the new Lucene 4.1 field compression (LUCENE-4226 if I am right) tackle this?

        Show
        Pieter added a comment - Doesn't the new Lucene 4.1 field compression ( LUCENE-4226 if I am right) tackle this?
        Hide
        David Smiley added a comment -

        LUCENE-4226 basically does this but you can't configure codecs; you pick a codec in its default mode. The Compressing codec defaults to "fast" and yields ~50% savings based on Adrien's tests of a "small to medium" sized index: https://issues.apache.org/jira/browse/LUCENE-4226?focusedCommentId=13451708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13451708

        But what I'd like to see is the ability to compress a large text field (alone), for the purposes of highlighting, and much more than 50% compression. It might not be able to handle that many concurrent requests to meet response time SLAs, but some search apps aren't under high load.

        Show
        David Smiley added a comment - LUCENE-4226 basically does this but you can't configure codecs; you pick a codec in its default mode. The Compressing codec defaults to "fast" and yields ~50% savings based on Adrien's tests of a "small to medium" sized index: https://issues.apache.org/jira/browse/LUCENE-4226?focusedCommentId=13451708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13451708 But what I'd like to see is the ability to compress a large text field (alone), for the purposes of highlighting, and much more than 50% compression. It might not be able to handle that many concurrent requests to meet response time SLAs, but some search apps aren't under high load.

          People

          • Assignee:
            Unassigned
            Reporter:
            Grant Ingersoll
          • Votes:
            5 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development