Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      To my understanding, Lucene 2.3 will easily index large documents. So shouldn't we get rid of the 10,000 default limit for the field length? 10,000 isn't that much and as Lucene doesn't have any error logging by default, this is a common problem for users that is difficult to debug if you don't know where to look.

      A better new default might be Integer.MAX_VALUE.

      1. LUCENE-1084.part3.patch
        4 kB
        Steve Rowe
      2. LUCENE-1084.part2.take2.patch
        7 kB
        Steve Rowe
      3. LUCENE-1084.part2.patch
        2 kB
        Steve Rowe
      4. LUCENE-1084.patch
        64 kB
        Steve Rowe

        Activity

        Hide
        Michael McCandless added a comment -

        +1

        Users frequently trip up on this one, and the fact that we silently truncate at 10,000 tokens is "unexpected". I think the default should be Integer.MAX_VALUE.

        How about doing this for 2.4?

        Show
        Michael McCandless added a comment - +1 Users frequently trip up on this one, and the fact that we silently truncate at 10,000 tokens is "unexpected". I think the default should be Integer.MAX_VALUE. How about doing this for 2.4?
        Hide
        Grant Ingersoll added a comment -

        Does this break back-compatibility if we change this in a minor version?

        Show
        Grant Ingersoll added a comment - Does this break back-compatibility if we change this in a minor version?
        Hide
        Michael McCandless added a comment -

        Does this break back-compatibility if we change this in a minor version?

        Technically it does, but I think (guesssing?) most of the time when
        this limit kicks in, the application (if they knew about it) would
        consider it a bug? Ie, fixing this would almost always improve
        applications?

        I think most Lucene users are not aware that this truncation is
        occurring (it's not something you'd expect).

        Or, are we thinking that there are quite a few applications where
        truly enormous documents are indexed and this change is going to cause
        massive slowdown for such applications?

        Show
        Michael McCandless added a comment - Does this break back-compatibility if we change this in a minor version? Technically it does, but I think (guesssing?) most of the time when this limit kicks in, the application (if they knew about it) would consider it a bug? Ie, fixing this would almost always improve applications? I think most Lucene users are not aware that this truncation is occurring (it's not something you'd expect). Or, are we thinking that there are quite a few applications where truly enormous documents are indexed and this change is going to cause massive slowdown for such applications?
        Hide
        Grant Ingersoll added a comment -

        I don't know the answer to those questions, I just know we faced similar issues with changing the StandardTokenizer. It was clearly doing something wrong (incorrectly marking acronyms), yet we had to do the deprecation dance to handle it b/c there may be applications that rely on that functionality. This one seems a little less problematic, I agree. There may, however, be applications out there that occasionally hit a really large document and don't want it to throw off their indexing, so they truncate it.

        I think we ought to have a separate discussion on our back compatibility standards on java-dev, as I, personally, think we have some room for improvement when it comes to these kinds of things. I will kick it off, as it has been something I have been mulling for a while.

        Show
        Grant Ingersoll added a comment - I don't know the answer to those questions, I just know we faced similar issues with changing the StandardTokenizer. It was clearly doing something wrong (incorrectly marking acronyms), yet we had to do the deprecation dance to handle it b/c there may be applications that rely on that functionality. This one seems a little less problematic, I agree. There may, however, be applications out there that occasionally hit a really large document and don't want it to throw off their indexing, so they truncate it. I think we ought to have a separate discussion on our back compatibility standards on java-dev, as I, personally, think we have some room for improvement when it comes to these kinds of things. I will kick it off, as it has been something I have been mulling for a while.
        Hide
        Doug Cutting added a comment -

        This kind of limit is common on web search engines. It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file). So changing it might indeed hurt folks who're indexing uncontrolled web content.

        Show
        Doug Cutting added a comment - This kind of limit is common on web search engines. It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file). So changing it might indeed hurt folks who're indexing uncontrolled web content.
        Hide
        Steve Rowe added a comment -

        An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor. That way, there is no silent loss (or gain) of content - the user must specify.

        Maybe the type of this extra parameter could be an enumeration, along the lines of BooleanClause.Occur, e.g. IndexWriter.FieldLength.LIMITED and IndexWriter.FieldLength.UNLIMITED. LIMITED would set the max to the current default, setMaxFieldLimit() would operate as it does currently, and UNLIMITED would set the max to Integer.MAX_VALUE.

        This solution would definitely be a backward-incompatible change, and would have to wait for the 3.0 release.

        Show
        Steve Rowe added a comment - An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor. That way, there is no silent loss (or gain) of content - the user must specify. Maybe the type of this extra parameter could be an enumeration, along the lines of BooleanClause.Occur, e.g. IndexWriter.FieldLength.LIMITED and IndexWriter.FieldLength.UNLIMITED. LIMITED would set the max to the current default, setMaxFieldLimit() would operate as it does currently, and UNLIMITED would set the max to Integer.MAX_VALUE. This solution would definitely be a backward-incompatible change, and would have to wait for the 3.0 release.
        Hide
        Michael McCandless added a comment -

        This kind of limit is common on web search engines. It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file). So changing it might indeed hurt folks who're indexing uncontrolled web content.

        OK, it seems like it's an important safeguard, and risky to change, so
        let's wait for 3.0.

        Maybe we could increase it from 10K --> 100K to reduce the times when
        a legit document is truncated?

        An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor. That way, there is no silent loss (or gain) of content - the user must specify.

        I think this is a good idea; it basically forces the user to confront
        the truncation issue up front.

        Show
        Michael McCandless added a comment - This kind of limit is common on web search engines. It prevents really big pages that crawlers find causing indexing and search from blowing up (think a 100MB PDF that claims it is a text file). So changing it might indeed hurt folks who're indexing uncontrolled web content. OK, it seems like it's an important safeguard, and risky to change, so let's wait for 3.0. Maybe we could increase it from 10K --> 100K to reduce the times when a legit document is truncated? An alternative to changing the default setting would be to not have a default - make it a required parameter to the IndexWriter constructor. That way, there is no silent loss (or gain) of content - the user must specify. I think this is a good idea; it basically forces the user to confront the truncation issue up front.
        Hide
        Steve Rowe added a comment -

        Attaching a patch implementing my suggestion to add an explicit maximum field length parameter to IndexWriter constructors. I named the inner class MaxFieldLength instead of FieldLength.

        The patch deprecates the pre-existing constructors, and adds a new constructor with the added parameter for each of the pre-existing constructors. As a result, the patch could be applied to the trunk as soon as 2.3 has been released; the deprecated pre-existing constructors would then be removed as part of the 3.0 release.

        The patch also replaces all IndexWriter constructor calls in the tree with their equivalents taking the explicit max field length parameter. (I excluded IndexModifier, because it is already scheduled for removal as part of the 3.0 release.)

        I see that in o.a.l.demo.IndexHTML.java, the max field length is set significantly above the current default limit:

        71: writer = new IndexWriter(index, new StandardAnalyzer(), create);
        72: writer.setMaxFieldLength(1000000);

        Should the above two lines be replaced with the following? (I did not do this in the attached patch):

        writer = new IndexWriter(index, new StandardAnalyzer(), create,
        IndexWriter.MaxFieldLength.UNLIMITED);

        That is, should the demo use Integer.MAX_VALUE instead of 1,000,000 for the maximum field length?

        Show
        Steve Rowe added a comment - Attaching a patch implementing my suggestion to add an explicit maximum field length parameter to IndexWriter constructors. I named the inner class MaxFieldLength instead of FieldLength. The patch deprecates the pre-existing constructors, and adds a new constructor with the added parameter for each of the pre-existing constructors. As a result, the patch could be applied to the trunk as soon as 2.3 has been released; the deprecated pre-existing constructors would then be removed as part of the 3.0 release. The patch also replaces all IndexWriter constructor calls in the tree with their equivalents taking the explicit max field length parameter. (I excluded IndexModifier, because it is already scheduled for removal as part of the 3.0 release.) I see that in o.a.l.demo.IndexHTML.java, the max field length is set significantly above the current default limit: 71: writer = new IndexWriter(index, new StandardAnalyzer(), create); 72: writer.setMaxFieldLength(1000000); Should the above two lines be replaced with the following? (I did not do this in the attached patch): writer = new IndexWriter(index, new StandardAnalyzer(), create, IndexWriter.MaxFieldLength.UNLIMITED); That is, should the demo use Integer.MAX_VALUE instead of 1,000,000 for the maximum field length?
        Hide
        Michael McCandless added a comment -

        This patch looks great; thanks Steven!

        This forces users to confront, up front, when creating IndexWriter,
        whether they want to truncate the fields.

        It deprecates all existing IndexWriter ctors, noting that in 3.0 they
        will be removed, in favor of the version that adds the MaxFieldLength
        argument.

        We can commit this now, and then removed the deprecated APIs in 3.0.
        I think with this change it's OK to leave the default truncation limit
        at 10,000?

        I think we can leave the IndexHTML.java with that current large limit?

        I plan to commit in a day or two!

        Show
        Michael McCandless added a comment - This patch looks great; thanks Steven! This forces users to confront, up front, when creating IndexWriter, whether they want to truncate the fields. It deprecates all existing IndexWriter ctors, noting that in 3.0 they will be removed, in favor of the version that adds the MaxFieldLength argument. We can commit this now, and then removed the deprecated APIs in 3.0. I think with this change it's OK to leave the default truncation limit at 10,000? I think we can leave the IndexHTML.java with that current large limit? I plan to commit in a day or two!
        Hide
        Michael McCandless added a comment -

        I just committed this. Thanks Daniel for opening it, and thanks Steven for the patch!

        Show
        Michael McCandless added a comment - I just committed this. Thanks Daniel for opening it, and thanks Steven for the patch!
        Hide
        Grant Ingersoll added a comment -

        Sorry for the lateness on this, was just looking at another patch that had the deprecated constructor and was going to replace it...

        How, exactly, is one supposed to specify an alternate value for the MaxFieldLength in the constructor when using the IndexWriter.MaxFieldLength constructor when it is private and not extendable? Or is it that I pass in a LIMITED or UNLIMITED and then immediately go call indexWriter.setMaxFieldLength()? That doesn't make sense to me from a user's point of view of trying to understand how to set the maxFieldLength, so I feel like I must be missing something.

        Show
        Grant Ingersoll added a comment - Sorry for the lateness on this, was just looking at another patch that had the deprecated constructor and was going to replace it... How, exactly, is one supposed to specify an alternate value for the MaxFieldLength in the constructor when using the IndexWriter.MaxFieldLength constructor when it is private and not extendable? Or is it that I pass in a LIMITED or UNLIMITED and then immediately go call indexWriter.setMaxFieldLength()? That doesn't make sense to me from a user's point of view of trying to understand how to set the maxFieldLength, so I feel like I must be missing something.
        Hide
        Michael McCandless added a comment -

        Yeah, the intention here was that you specify LIMITED or UNLIMITED enum parameter on creating the IW, and then change the actual limit using setMaxFieldLength.

        But I agree it'd be cleaner if you could actually just set the limit (once) on construction.

        I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit?

        Show
        Michael McCandless added a comment - Yeah, the intention here was that you specify LIMITED or UNLIMITED enum parameter on creating the IW, and then change the actual limit using setMaxFieldLength. But I agree it'd be cleaner if you could actually just set the limit (once) on construction. I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit?
        Hide
        Steve Rowe added a comment -

        Reopening to attach a patch to allow for (variable) user specification of maximum field length in IndexWriter constructors.

        Show
        Steve Rowe added a comment - Reopening to attach a patch to allow for (variable) user specification of maximum field length in IndexWriter constructors.
        Hide
        Steve Rowe added a comment -

        I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit?

        +1

        Attaching a patch implementing this idea.

        Show
        Steve Rowe added a comment - I think if we stopped subclassing from Parameter (we don't really need Serializable here?), and made the ctor public, then you could pass in "new MaxFieldLength(100000)" to make your own limit? +1 Attaching a patch implementing this idea.
        Hide
        Steve Rowe added a comment -

        The javadoc description of the MaxFieldLength parameter (mfl) in the IndexWriter constructors is now "whether or not to limit field lengths".

        This should probably be rewritten in light of the change allowing user specification of values other than LIMITED or UNLIMITED. How's this?:

        /** ...
         * @param mfl Maximum field length: LIMITED, UNLIMITED, or user-specified
         *   via the MaxFieldLength constructor.
         * ...
         */
        

        Attaching part2.take2 patch including this javadoc change and the MaxFieldLength extension and constructor changes.

        Show
        Steve Rowe added a comment - The javadoc description of the MaxFieldLength parameter (mfl) in the IndexWriter constructors is now "whether or not to limit field lengths". This should probably be rewritten in light of the change allowing user specification of values other than LIMITED or UNLIMITED. How's this?: /** ... * @param mfl Maximum field length: LIMITED, UNLIMITED, or user-specified * via the MaxFieldLength constructor. * ... */ Attaching part2.take2 patch including this javadoc change and the MaxFieldLength extension and constructor changes.
        Hide
        Michael McCandless added a comment -

        This looks good! Thanks Steven. I'll commit.

        Show
        Michael McCandless added a comment - This looks good! Thanks Steven. I'll commit.
        Hide
        Steve Rowe added a comment -

        Mike, I see you added a test for the user-specified max field length - cool.

        It made me think of the IndexHTML.java usage, which should probably be changed to conform to the new style:

              writer = new IndexWriter(index, new StandardAnalyzer(), create, 
                                       IndexWriter.MaxFieldLength.LIMITED);
              writer.setMaxFieldLength(1000000);
        

        should be:

              writer = new IndexWriter(index, new StandardAnalyzer(), create, 
                                       new IndexWriter.MaxFieldLength(1000000));
        

        Hmm, now that I look, I can see several other new IndexWriter() / setMaxFieldLength() sequences that should be changed ... I'll submit a patch shortly.

        Show
        Steve Rowe added a comment - Mike, I see you added a test for the user-specified max field length - cool. It made me think of the IndexHTML.java usage, which should probably be changed to conform to the new style: writer = new IndexWriter(index, new StandardAnalyzer(), create, IndexWriter.MaxFieldLength.LIMITED); writer.setMaxFieldLength(1000000); should be: writer = new IndexWriter(index, new StandardAnalyzer(), create, new IndexWriter.MaxFieldLength(1000000)); Hmm, now that I look, I can see several other new IndexWriter() / setMaxFieldLength() sequences that should be changed ... I'll submit a patch shortly.
        Hide
        Steve Rowe added a comment -

        Patch replacing [IW constructor, MFL setter] sequences with IW constructor taking new'd MFL.

        Show
        Steve Rowe added a comment - Patch replacing [IW constructor, MFL setter] sequences with IW constructor taking new'd MFL.
        Hide
        Michael McCandless added a comment -

        Good – I'll commit. Thanks!

        Show
        Michael McCandless added a comment - Good – I'll commit. Thanks!

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Daniel Naber
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development