Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: update
    • Labels:
      None

      Description

      As discussed shortly on the mailing list (http://www.mail-archive.com/solr-user@lucene.apache.org/msg09807.html), the objective of this task is to add a maxLength property to the CopyField "command". This property simply limits the number of characters that are copied.

      This is particularly useful to avoid very slow highlighting when the index contains big documents.

      Example :
      <copyField source="text" dest="highlight" maxLength="30000" />

      This approach has also the advantage of limiting the index size for large documents (the original text field does not need to be stored and to have term vectors). However, the index is bigger for small documents...

      1. SOLR-538-for-1.3.patch
        42 kB
        Koji Sekiguchi
      2. SOLR-538.patch
        43 kB
        Lars Kotthoff
      3. SOLR-538.patch
        43 kB
        Lars Kotthoff
      4. SOLR-538.patch
        43 kB
        Lars Kotthoff
      5. SOLR-538.patch
        43 kB
        Lars Kotthoff
      6. SOLR-538.patch
        43 kB
        Chris Harris
      7. SOLR-538.patch
        44 kB
        Lars Kotthoff
      8. SOLR-538.patch
        43 kB
        Chris Harris
      9. SOLR-538.patch
        43 kB
        Chris Harris
      10. CopyFieldMaxLength.patch
        57 kB
        Georgios Stamatis
      11. CopyFieldMaxLength.patch
        42 kB
        Georgios Stamatis

        Activity

        Hide
        Georgios Stamatis added a comment -

        Should be applied on svn trunk (tested on revision 647855)

        Show
        Georgios Stamatis added a comment - Should be applied on svn trunk (tested on revision 647855)
        Hide
        Georgios Stamatis added a comment -

        Updated for latest trunk (tested on revision 669140)

        Show
        Georgios Stamatis added a comment - Updated for latest trunk (tested on revision 669140)
        Hide
        Otis Gospodnetic added a comment -

        This no longer applies. Could you bring it up to date? I also noticed new code that introduces new coding style (e.g. tab indentation instead of 2-space indentation) - could you please make it look like the existing Solr code style?

        $ patch -p1 -i CopyFieldMaxLength.patch --dry-run
        patching file src/java/org/apache/solr/schema/CopyField.java
        patching file src/java/org/apache/solr/schema/IndexSchema.java
        Hunk #1 succeeded at 601 with fuzz 2 (offset 44 lines).
        Hunk #2 FAILED at 615.
        Hunk #3 FAILED at 636.
        Hunk #4 FAILED at 650.
        Hunk #5 succeeded at 859 (offset 67 lines).
        Hunk #6 succeeded at 884 (offset 67 lines).
        Hunk #7 succeeded at 1065 (offset 67 lines).
        Hunk #8 FAILED at 1086.
        Hunk #9 succeeded at 1099 (offset 67 lines).
        Hunk #10 succeeded at 1112 (offset 67 lines).
        4 out of 10 hunks FAILED – saving rejects to file src/java/org/apache/solr/schema/IndexSchema.java.rej
        patching file src/java/org/apache/solr/update/DocumentBuilder.java
        Hunk #2 succeeded at 121 (offset 29 lines).
        Hunk #3 succeeded at 129 (offset 29 lines).
        Hunk #4 succeeded at 219 (offset 29 lines).
        Hunk #5 succeeded at 247 (offset 29 lines).
        Hunk #6 succeeded at 259 (offset 29 lines).
        patching file src/test/org/apache/solr/schema/CopyFieldTest.java
        patching file src/test/test-files/solr/conf/schema-copyfield-test.xml

        Show
        Otis Gospodnetic added a comment - This no longer applies. Could you bring it up to date? I also noticed new code that introduces new coding style (e.g. tab indentation instead of 2-space indentation) - could you please make it look like the existing Solr code style? $ patch -p1 -i CopyFieldMaxLength.patch --dry-run patching file src/java/org/apache/solr/schema/CopyField.java patching file src/java/org/apache/solr/schema/IndexSchema.java Hunk #1 succeeded at 601 with fuzz 2 (offset 44 lines). Hunk #2 FAILED at 615. Hunk #3 FAILED at 636. Hunk #4 FAILED at 650. Hunk #5 succeeded at 859 (offset 67 lines). Hunk #6 succeeded at 884 (offset 67 lines). Hunk #7 succeeded at 1065 (offset 67 lines). Hunk #8 FAILED at 1086. Hunk #9 succeeded at 1099 (offset 67 lines). Hunk #10 succeeded at 1112 (offset 67 lines). 4 out of 10 hunks FAILED – saving rejects to file src/java/org/apache/solr/schema/IndexSchema.java.rej patching file src/java/org/apache/solr/update/DocumentBuilder.java Hunk #2 succeeded at 121 (offset 29 lines). Hunk #3 succeeded at 129 (offset 29 lines). Hunk #4 succeeded at 219 (offset 29 lines). Hunk #5 succeeded at 247 (offset 29 lines). Hunk #6 succeeded at 259 (offset 29 lines). patching file src/test/org/apache/solr/schema/CopyFieldTest.java patching file src/test/test-files/solr/conf/schema-copyfield-test.xml
        Hide
        Lars Kotthoff added a comment -

        Attaching new patch which applies to current trunk and changes some minor things wrt error handling and testing.

        Show
        Lars Kotthoff added a comment - Attaching new patch which applies to current trunk and changes some minor things wrt error handling and testing.
        Hide
        Otis Gospodnetic added a comment -

        Lars, do you think you could change those tabs into 2 spaces to follow the code style convention?

        In CopyFieldTest I see things like:

        + try

        { + new CopyField(null, + new SchemaField("destination", new TextField()), 1000); + fail("CopyField failed with null SchemaField argument."); + }

        catch (IllegalArgumentException e) {
        + assertTrue(e.getLocalizedMessage().contains("can't be NULL"));

        Does the assertTrue message really accurately describe the exception? What is null in the above case? What happens when maxLength="0"? The previous/default copy-everything behaviour, right? Maybe I missed it, but I didn't see that in any javadocs or comments or the example schema.xml.

        Show
        Otis Gospodnetic added a comment - Lars, do you think you could change those tabs into 2 spaces to follow the code style convention? In CopyFieldTest I see things like: + try { + new CopyField(null, + new SchemaField("destination", new TextField()), 1000); + fail("CopyField failed with null SchemaField argument."); + } catch (IllegalArgumentException e) { + assertTrue(e.getLocalizedMessage().contains("can't be NULL")); Does the assertTrue message really accurately describe the exception? What is null in the above case? What happens when maxLength="0"? The previous/default copy-everything behaviour, right? Maybe I missed it, but I didn't see that in any javadocs or comments or the example schema.xml.
        Hide
        Lars Kotthoff added a comment -

        Attaching new patch which fixes tabs.

        Does the assertTrue message really accurately describe the exception?

        That's actually not the assertTrue message, that's asserting that the exception message contains that string.

        The previous/default copy-everything behaviour, right? Maybe I missed it, but I didn't see that in any javadocs or comments or the example schema.xml.

        I've added a note to the schema.xml. It is also mentioned in the javadocs for the CopyField constructor.

        Show
        Lars Kotthoff added a comment - Attaching new patch which fixes tabs. Does the assertTrue message really accurately describe the exception? That's actually not the assertTrue message, that's asserting that the exception message contains that string. The previous/default copy-everything behaviour, right? Maybe I missed it, but I didn't see that in any javadocs or comments or the example schema.xml. I've added a note to the schema.xml. It is also mentioned in the javadocs for the CopyField constructor.
        Hide
        Chris Harris added a comment -

        When I apply this patch, the last line of test-files/solr/conf/schema-copyfield-test.xml seems to get cut off. That is, rather than

        [...]
        <!-- Similarity is the scoring routine for each document vs a query.
        A custom similarity may be specified here, but the default is fine
        for most applications.
        -->
        <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> -->
        blank line
        </schema>
        EOF

        I get

        [...]
        <!-- Similarity is the scoring routine for each document vs a query.
        A custom similarity may be specified here, but the default is fine
        for most applications.
        -->
        <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> -->
        blank line
        EOF

        This makes the XML invalid, and makes "ant test" fail.

        I thought I was just being inept with Windows/TortoiseSVN, but now I've had the same thing happen when applying the patch with the patch command on the OS X command line. This makes me suspicious that there might be something wrong with the patch – though I can't find anything wrong by looking at it manually. Any thoughts?

        Show
        Chris Harris added a comment - When I apply this patch, the last line of test-files/solr/conf/schema-copyfield-test.xml seems to get cut off. That is, rather than [...] <!-- Similarity is the scoring routine for each document vs a query. A custom similarity may be specified here, but the default is fine for most applications. --> <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> --> blank line </schema> EOF I get [...] <!-- Similarity is the scoring routine for each document vs a query. A custom similarity may be specified here, but the default is fine for most applications. --> <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> --> blank line EOF This makes the XML invalid, and makes "ant test" fail. I thought I was just being inept with Windows/TortoiseSVN, but now I've had the same thing happen when applying the patch with the patch command on the OS X command line. This makes me suspicious that there might be something wrong with the patch – though I can't find anything wrong by looking at it manually. Any thoughts?
        Hide
        Lars Kotthoff added a comment -

        When I apply this patch, the last line of test-files/solr/conf/schema-copyfield-test.xml seems to get cut off.

        Sorry, my bad. That's what you get for modifying diffs manually. I'm attaching a new version which should fix this problem.

        Show
        Lars Kotthoff added a comment - When I apply this patch, the last line of test-files/solr/conf/schema-copyfield-test.xml seems to get cut off. Sorry, my bad. That's what you get for modifying diffs manually. I'm attaching a new version which should fix this problem.
        Hide
        Chris Harris added a comment -

        Thanks, Lars; that was fast. I think this patch is going to be handy.

        I'm wondering what people thought about an alternative approach to keeping stored fields from being too large, which would require mucking around with Lucene. In particular, the idea would be to allow field definitions like this:

        <field name="body" type="text" indexed="true" stored="true"
        omitNorms="false" compressed="true"
        maxFieldLength="2000" storeOnlyAnalyzedText="true"
        />

        Here we've made the normal Lucene maxFieldLength (i.e. # tokens to analyze) configurable a field-by-field basis. And in this declaration we've also made it so that what is stored is a function of what is analyzed. (Here if the first 2,000 tokens correspond to the first, say, 8,000 characters, then those 8,000 characters are what's going to be actually stored in the stored field.) This seems a little more natural than lopping off the text after a fixed number of characters.

        If I could do the above, I'm thinking I would use that single field for both searching and highlighting. But if you wanted a separate field for highlighting (and were willing to have things run slower than with the current patch), then you could do this:

        <field name="body" type="text" indexed="true" stored="false" omitNorms="false" />
        <field name="highlighting" type="text" indexed="false" stored="true"
        compressed="true" maxFieldLength="2000" storeOnlyAnalyzedText="true" />
        <copyField src="body" dest="highlighting" />

        Show
        Chris Harris added a comment - Thanks, Lars; that was fast. I think this patch is going to be handy. I'm wondering what people thought about an alternative approach to keeping stored fields from being too large, which would require mucking around with Lucene. In particular, the idea would be to allow field definitions like this: <field name="body" type="text" indexed="true" stored="true" omitNorms="false" compressed="true" maxFieldLength="2000" storeOnlyAnalyzedText="true" /> Here we've made the normal Lucene maxFieldLength (i.e. # tokens to analyze) configurable a field-by-field basis. And in this declaration we've also made it so that what is stored is a function of what is analyzed. (Here if the first 2,000 tokens correspond to the first, say, 8,000 characters, then those 8,000 characters are what's going to be actually stored in the stored field.) This seems a little more natural than lopping off the text after a fixed number of characters. If I could do the above, I'm thinking I would use that single field for both searching and highlighting. But if you wanted a separate field for highlighting (and were willing to have things run slower than with the current patch), then you could do this: <field name="body" type="text" indexed="true" stored="false" omitNorms="false" /> <field name="highlighting" type="text" indexed="false" stored="true" compressed="true" maxFieldLength="2000" storeOnlyAnalyzedText="true" /> <copyField src="body" dest="highlighting" />
        Hide
        Lars Kotthoff added a comment -

        Interesting idea, but this should probably be a separate issue. It would require more significant changes, for example the update handler should probably warn when the value for a field is truncated etc.

        Show
        Lars Kotthoff added a comment - Interesting idea, but this should probably be a separate issue. It would require more significant changes, for example the update handler should probably warn when the value for a field is truncated etc.
        Hide
        Lars Kotthoff added a comment -

        Syncing patch with trunk.

        Show
        Lars Kotthoff added a comment - Syncing patch with trunk.
        Hide
        Chris Harris added a comment -

        Syncing w/ trunk r712067

        Show
        Chris Harris added a comment - Syncing w/ trunk r712067
        Hide
        Koji Sekiguchi added a comment - - edited

        Thank you for the updated patch. I have some comments.

        • I think this breaks back-compat:
        Index: src/java/org/apache/solr/schema/IndexSchema.java
        ===================================================================
        --- src/java/org/apache/solr/schema/IndexSchema.java	(revision 712067)
        +++ src/java/org/apache/solr/schema/IndexSchema.java	(working copy)
          :
        @@ -654,12 +664,12 @@
            * 
            * @see SolrCoreAware
            */
        -  public void registerCopyField( String source, String dest )
        +  public void registerCopyField( String source, String dest, Integer maxLength )
        

        and the type of maxLength is int (not Integer) right?

        • Can you use 2-space indentation instead of tab indentation?
        • Can you change "@since 1.3" javadoc comments to "@since 1.4"?
        • The new attribute "maxLength" may make users confused (number of chars or tokens?). How about using "maxChars" or something?
        Show
        Koji Sekiguchi added a comment - - edited Thank you for the updated patch. I have some comments. I think this breaks back-compat: Index: src/java/org/apache/solr/schema/IndexSchema.java =================================================================== --- src/java/org/apache/solr/schema/IndexSchema.java (revision 712067) +++ src/java/org/apache/solr/schema/IndexSchema.java (working copy) : @@ -654,12 +664,12 @@ * * @see SolrCoreAware */ - public void registerCopyField( String source, String dest ) + public void registerCopyField( String source, String dest, Integer maxLength ) and the type of maxLength is int (not Integer) right? Can you use 2-space indentation instead of tab indentation? Can you change "@since 1.3" javadoc comments to "@since 1.4"? The new attribute "maxLength" may make users confused (number of chars or tokens?). How about using "maxChars" or something?
        Hide
        Koji Sekiguchi added a comment -

        Attached a patch for branch-1.3. I have a performance issue on a highlight field and I'll see if this helps for my project in next week (I know that hl.maxAnalyzedChars helps, but I'd like to reduce index size with SOLR-538).

        Show
        Koji Sekiguchi added a comment - Attached a patch for branch-1.3. I have a performance issue on a highlight field and I'll see if this helps for my project in next week (I know that hl.maxAnalyzedChars helps, but I'd like to reduce index size with SOLR-538 ).
        Hide
        Lars Kotthoff added a comment -

        Attaching new patch which

        • adds method for backwards compatibility
        • replaces tabs with 2 spaces
        • fixes javadoc
        • renames maxLength to maxChars
          as per Koji's suggestions.
        Show
        Lars Kotthoff added a comment - Attaching new patch which adds method for backwards compatibility replaces tabs with 2 spaces fixes javadoc renames maxLength to maxChars as per Koji's suggestions.
        Hide
        Chris Harris added a comment -

        Small change to bring Lars' 2008-11-08 version of SOLR-538.patch in sync with trunk r719187

        Show
        Chris Harris added a comment - Small change to bring Lars' 2008-11-08 version of SOLR-538 .patch in sync with trunk r719187
        Hide
        Chris Harris added a comment -

        Same as prev SOLR-538.patch (2008-11-21 02:21 PM), except with some unneeded carriage return characters removed. (This may be overly cautious, but I don't want those to cause problems for anyone.)

        Show
        Chris Harris added a comment - Same as prev SOLR-538 .patch (2008-11-21 02:21 PM), except with some unneeded carriage return characters removed. (This may be overly cautious, but I don't want those to cause problems for anyone.)
        Hide
        Shalin Shekhar Mangar added a comment -

        Looking at the watchers, voters and the comments, a lot of people seem to be interested in this issue. Is there any objection on committing it to trunk?

        Koji, since you have used the patch yourself, would you like to take this up?

        Show
        Shalin Shekhar Mangar added a comment - Looking at the watchers, voters and the comments, a lot of people seem to be interested in this issue. Is there any objection on committing it to trunk? Koji, since you have used the patch yourself, would you like to take this up?
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 721758. Thanks everyone!

        Show
        Koji Sekiguchi added a comment - Committed revision 721758. Thanks everyone!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Nicolas Dessaigne
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development