Details

    • Type: Task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Follow-up of this comment: https://issues.apache.org/jira/browse/LUCENE-6818?focusedCommentId=14934801&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14934801

      Index-time boosts are a very expert feature whose behaviour is tight to the Similarity impl. Additionally users have often be confused by the poor precision due to the fact that we encode values on a single byte. But now we have doc values that allow you to encode any values the way you want with as much precision as you need so maybe we should deprecate index-time boosts and recommend to encode index-time scoring factors into doc values fields instead.

      1. LUCENE-6819.patch
        210 kB
        Adrien Grand
      2. LUCENE-6819.patch
        205 kB
        Adrien Grand
      3. LUCENE-6819-deprecation.patch
        12 kB
        Adrien Grand
      4. LUCENE-6819-deprecation.patch
        11 kB
        Adrien Grand
      5. LUCENE-6819-deprecation.patch
        6 kB
        Adrien Grand
      6. LUCENE-6819-wip.patch
        3 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          An interesting side-effect of such a change is that we could make the length normalization factors more accurate. If we remove index-time boosts, then length normalization factors would always be less than or equal to 1 so we could take one bit from the exponent and use it for the mantissa instead.

          Show
          jpountz Adrien Grand added a comment - An interesting side-effect of such a change is that we could make the length normalization factors more accurate. If we remove index-time boosts, then length normalization factors would always be less than or equal to 1 so we could take one bit from the exponent and use it for the mantissa instead.
          Hide
          dsmiley David Smiley added a comment -

          Index time boosts are valuable. I've found that tuning index time boosts is the easiest way to boost documents that has the most predictable effect, relative to query time boosts. Of course it's inflexible but it's a trade-off.

          Show
          dsmiley David Smiley added a comment - Index time boosts are valuable. I've found that tuning index time boosts is the easiest way to boost documents that has the most predictable effect, relative to query time boosts. Of course it's inflexible but it's a trade-off.
          Hide
          jpountz Adrien Grand added a comment -

          I agree index-time and search-time boosting have different trade-offs that may both be interesting. The problem I have is that supporting index-time boosts means that length norm is less accurate for everyone. Right now if you do not use index-time boosts, which I think is the case for a majority of users, you end up with a length norm that is between 0 and 1 (1/sqrt(fieldLen)). The length norm may only be greater than 1 if you use a boost that is greater than 1. Out of the 256 values that SmallFloat.byte315ToFloat supports, only 125 of them are less than or equal to 1, the other 131 values are all greater than 1. Said otherwise, more than half the norm values we support are wasted if you do not use index-time boosts.

          If instead we could assume that norms were always between 0 and 1, we could take one bit from the exponent and spend it on the mantissa instead to improve accuracy. For instance I rebuilt the table that had been built for LUCENE-5005 and expanded it with a couple more length values, as well as what the rounded norm would be if we spent 1 more bit on the mantissa (while still being able to encode the norm on a single byte, see the float415 column):

          numTerms 1/sqrt(numTerms) 1/sqrt(numTerms) to float315 1/sqrt(numTerms) to float415
          1 1.0 1.0 1.0
          2 0.70710677 0.625 0.6875
          3 0.57735026 0.5 0.5625
          4 0.5 0.5 0.5
          5 0.4472136 0.4375 0.4375
          6 0.4082483 0.375 0.40625
          7 0.37796447 0.375 0.375
          8 0.35355338 0.3125 0.34375
          9 0.33333334 0.3125 0.3125
          10 0.31622776 0.3125 0.3125
          11 0.30151135 0.25 0.28125
          12 0.28867513 0.25 0.28125
          13 0.2773501 0.25 0.25
          14 0.26726124 0.25 0.25
          15 0.2581989 0.25 0.25
          16 0.25 0.25 0.25
          17 0.24253562 0.21875 0.234375
          18 0.23570226 0.21875 0.234375
          19 0.22941573 0.21875 0.21875
          20 0.2236068 0.21875 0.21875

          Something I really like about it is that for all length values between 1 and 9 included, you get different values for the rounded norms. I have seen several users asking why "A B C D" would score as well as "A B C" when the query is eg. "A" in spite of being longer, and if we could get this addressed for short fields (think eg. product names), I think that would be a great win.

          Show
          jpountz Adrien Grand added a comment - I agree index-time and search-time boosting have different trade-offs that may both be interesting. The problem I have is that supporting index-time boosts means that length norm is less accurate for everyone . Right now if you do not use index-time boosts, which I think is the case for a majority of users, you end up with a length norm that is between 0 and 1 ( 1/sqrt(fieldLen) ). The length norm may only be greater than 1 if you use a boost that is greater than 1. Out of the 256 values that SmallFloat.byte315ToFloat supports, only 125 of them are less than or equal to 1, the other 131 values are all greater than 1. Said otherwise, more than half the norm values we support are wasted if you do not use index-time boosts. If instead we could assume that norms were always between 0 and 1, we could take one bit from the exponent and spend it on the mantissa instead to improve accuracy. For instance I rebuilt the table that had been built for LUCENE-5005 and expanded it with a couple more length values, as well as what the rounded norm would be if we spent 1 more bit on the mantissa (while still being able to encode the norm on a single byte, see the float415 column): numTerms 1/sqrt(numTerms) 1/sqrt(numTerms) to float315 1/sqrt(numTerms) to float415 1 1.0 1.0 1.0 2 0.70710677 0.625 0.6875 3 0.57735026 0.5 0.5625 4 0.5 0.5 0.5 5 0.4472136 0.4375 0.4375 6 0.4082483 0.375 0.40625 7 0.37796447 0.375 0.375 8 0.35355338 0.3125 0.34375 9 0.33333334 0.3125 0.3125 10 0.31622776 0.3125 0.3125 11 0.30151135 0.25 0.28125 12 0.28867513 0.25 0.28125 13 0.2773501 0.25 0.25 14 0.26726124 0.25 0.25 15 0.2581989 0.25 0.25 16 0.25 0.25 0.25 17 0.24253562 0.21875 0.234375 18 0.23570226 0.21875 0.234375 19 0.22941573 0.21875 0.21875 20 0.2236068 0.21875 0.21875 Something I really like about it is that for all length values between 1 and 9 included, you get different values for the rounded norms. I have seen several users asking why "A B C D" would score as well as "A B C" when the query is eg. "A" in spite of being longer, and if we could get this addressed for short fields (think eg. product names), I think that would be a great win.
          Hide
          dsmiley David Smiley added a comment - - edited

          I get your point. It's a shame that the particular use of the bits right now was decided to have both 3 terms and 4 terms produce the same norm when, IMO, there should be more fidelity for for them for the same reason you mentioned. Maybe this specifically could be rectified instead of removal of index time boosts?

          (edited/removed paragraph I reconsidered)

          On the other hand, I appreciate that removing this feature would be the simplest route to take and reduce overall complexity in Lucene. And it's not like index time boosts is a must-have; users can emulate it, albeit with some work. Maybe that could be made easier... hmmm.

          Any way; I'm not standing in your way. I'm curious what others think.

          Show
          dsmiley David Smiley added a comment - - edited I get your point. It's a shame that the particular use of the bits right now was decided to have both 3 terms and 4 terms produce the same norm when, IMO, there should be more fidelity for for them for the same reason you mentioned. Maybe this specifically could be rectified instead of removal of index time boosts? (edited/removed paragraph I reconsidered) On the other hand, I appreciate that removing this feature would be the simplest route to take and reduce overall complexity in Lucene. And it's not like index time boosts is a must-have; users can emulate it, albeit with some work. Maybe that could be made easier... hmmm. Any way; I'm not standing in your way. I'm curious what others think.
          Hide
          jpountz Adrien Grand added a comment -

          Here's a patch in case someone would like to run some relevancy tests. I goes even further and uses a completely different encoding that stores lengths in a byte. It is fully accurate up to 40 and then accuracy degrades linearly with the log of the length. It has a restriction that it does not support index boosts, but on the other hand, making assumptions that index boosts are not used allows it to make the 256 values useful, while with the current encoding, if index boosts are not used, only 63 values represent valid lengths: other values are either less than 1 or greater than MAX_VALUE.

          The patch is just a proof of concept and does not try to tackle the removal of index-time boosts or backward compatibility, which are the hard problems here.

          Show
          jpountz Adrien Grand added a comment - Here's a patch in case someone would like to run some relevancy tests. I goes even further and uses a completely different encoding that stores lengths in a byte. It is fully accurate up to 40 and then accuracy degrades linearly with the log of the length. It has a restriction that it does not support index boosts, but on the other hand, making assumptions that index boosts are not used allows it to make the 256 values useful, while with the current encoding, if index boosts are not used, only 63 values represent valid lengths: other values are either less than 1 or greater than MAX_VALUE. The patch is just a proof of concept and does not try to tackle the removal of index-time boosts or backward compatibility, which are the hard problems here.
          Hide
          thetaphi Uwe Schindler added a comment -

          +1 to remove index time boost. I always recommend to user to add doc values fields and use a function query (its just wrapping the query, very easy anyways!). About Solr users: I don't even know if it is at all possible with Solr to add index time boosts?

          Show
          thetaphi Uwe Schindler added a comment - +1 to remove index time boost. I always recommend to user to add doc values fields and use a function query (its just wrapping the query, very easy anyways!). About Solr users: I don't even know if it is at all possible with Solr to add index time boosts?
          Hide
          janhoy Jan Høydahl added a comment -

          Index time boosts are documented in https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-AddingDocuments and added like this

          <doc boost="2.0">
            <field name="foo" boost="3.0">bar</field>
          </doc>
          

          I won't be sorry if they disappear, never recommend their usage anyway

          Show
          janhoy Jan Høydahl added a comment - Index time boosts are documented in https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-AddingDocuments and added like this <doc boost= "2.0" > <field name= "foo" boost= "3.0" > bar </field> </doc> I won't be sorry if they disappear, never recommend their usage anyway
          Hide
          janhoy Jan Høydahl added a comment -

          If removal is plannend for 7.0, we must make sure at least one minor 6.x release contains deprecation and documentation heads-up, also for Solr, else I'll be -1 on committing this.

          Show
          janhoy Jan Høydahl added a comment - If removal is plannend for 7.0, we must make sure at least one minor 6.x release contains deprecation and documentation heads-up, also for Solr, else I'll be -1 on committing this.
          Hide
          jpountz Adrien Grand added a comment -

          I think it is a fair requirement to make the upgrade easier. Here is a patch that removes index-time boosts. It makes Solr ignore index-time boosts, and log a debug message when a value that is different from 1.0 is supplied, in order to not break hard for users who would go to 7.0 without making sure to never set boosts. I can change it to a hard exception if you would like it better.

          As far as 6.x is concerned, I think it would be enough to deprecate IndexableField.boost(), Field.setBoost(float) and methods in the Solr client API that take a boost parameter (most of them already have a variant that does not take a boost and implicitly sets it to 1.0, which we would keep)?

          Show
          jpountz Adrien Grand added a comment - I think it is a fair requirement to make the upgrade easier. Here is a patch that removes index-time boosts. It makes Solr ignore index-time boosts, and log a debug message when a value that is different from 1.0 is supplied, in order to not break hard for users who would go to 7.0 without making sure to never set boosts. I can change it to a hard exception if you would like it better. As far as 6.x is concerned, I think it would be enough to deprecate IndexableField.boost() , Field.setBoost(float) and methods in the Solr client API that take a boost parameter (most of them already have a variant that does not take a boost and implicitly sets it to 1.0, which we would keep)?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          It makes Solr ignore index-time boosts, and log a debug message when a value that is different from 1.0 is supplied, in order to not break hard for users who would go to 7.0 without making sure to never set boosts.

          +1, I think this is the right approach.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - It makes Solr ignore index-time boosts, and log a debug message when a value that is different from 1.0 is supplied, in order to not break hard for users who would go to 7.0 without making sure to never set boosts. +1, I think this is the right approach.
          Hide
          janhoy Jan Høydahl added a comment -

          Wow, there is a lot of Solr code handling index time boosts
          Will you attach a separate 6x deprecation patch?

          Deprecating the relevant solrj methods sounds fine. And for the deprecation warning, perhaps that should be on WARN level but log only the first occurrence? That way people would spot this in production on 6.5, even if they did not fine-read the release notes

          Show
          janhoy Jan Høydahl added a comment - Wow, there is a lot of Solr code handling index time boosts Will you attach a separate 6x deprecation patch? Deprecating the relevant solrj methods sounds fine. And for the deprecation warning, perhaps that should be on WARN level but log only the first occurrence? That way people would spot this in production on 6.5, even if they did not fine-read the release notes
          Hide
          jpountz Adrien Grand added a comment -

          Sure, I can attach a patch for the deprecation separately.

          perhaps that should be on WARN level but log only the first occurrence?

          Is there an easy way to do this?

          Show
          jpountz Adrien Grand added a comment - Sure, I can attach a patch for the deprecation separately. perhaps that should be on WARN level but log only the first occurrence? Is there an easy way to do this?
          Hide
          thetaphi Uwe Schindler added a comment -

          "correct" way would be to use an AtomicBoolean, but a conventional, static boolean next to the LOGGER declaration should be fine, too. On concurrent use, you may get mutliple warnings for a short time until cache flush, but who cares?

          Show
          thetaphi Uwe Schindler added a comment - "correct" way would be to use an AtomicBoolean, but a conventional, static boolean next to the LOGGER declaration should be fine, too. On concurrent use, you may get mutliple warnings for a short time until cache flush, but who cares?
          Hide
          jpountz Adrien Grand added a comment -

          Here is a proposed patch for 6.x.

          Show
          jpountz Adrien Grand added a comment - Here is a proposed patch for 6.x.
          Hide
          jpountz Adrien Grand added a comment -

          I updated the master patch to use the warning level for the first message, and them debug.

          Show
          jpountz Adrien Grand added a comment - I updated the master patch to use the warning level for the first message, and them debug.
          Hide
          jpountz Adrien Grand added a comment -

          And the 6.x patch now also logs messages to warn users that this feature will go away.

          Show
          jpountz Adrien Grand added a comment - And the 6.x patch now also logs messages to warn users that this feature will go away.
          Hide
          janhoy Jan Høydahl added a comment -

          I like the Atomic part and alternating between WARN on first-time and DEBUG thereafter!

          Should the deprecation patch have a solr/CHANGES.txt entry even if it is a LUCENE issue? Well, it's a hybrid issue, and we have explicitly referenced LUCENE issues in solr CHANGES earlier.

          Show
          janhoy Jan Høydahl added a comment - I like the Atomic part and alternating between WARN on first-time and DEBUG thereafter! Should the deprecation patch have a solr/CHANGES.txt entry even if it is a LUCENE issue? Well, it's a hybrid issue, and we have explicitly referenced LUCENE issues in solr CHANGES earlier.
          Hide
          thetaphi Uwe Schindler added a comment -

          Atomic part is fine. When proposing not to use atomics I did not notice that the getAndSet does not happen for the common case (without boost), so there is no performance issue.

          Show
          thetaphi Uwe Schindler added a comment - Atomic part is fine. When proposing not to use atomics I did not notice that the getAndSet does not happen for the common case (without boost), so there is no performance issue.
          Hide
          jpountz Adrien Grand added a comment -

          Here is an updated 6.x deprecation patch with CHANGES entries. If there are no objections, I will merge shortly as these large patches do not age well and can become hard to apply in case of conflicting changes.

          Show
          jpountz Adrien Grand added a comment - Here is an updated 6.x deprecation patch with CHANGES entries. If there are no objections, I will merge shortly as these large patches do not age well and can become hard to apply in case of conflicting changes.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5f8a6dfff65599d961b99c6ff03b70b79e2ccaf4 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5f8a6df ]

          LUCENE-6819: Deprecate index-time boosts.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5f8a6dfff65599d961b99c6ff03b70b79e2ccaf4 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5f8a6df ] LUCENE-6819 : Deprecate index-time boosts.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8ed2b764ed4d4d5203b5df1e16fdc1ffd640322c in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ed2b76 ]

          LUCENE-6819: Remove index-time boosts.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8ed2b764ed4d4d5203b5df1e16fdc1ffd640322c in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ed2b76 ] LUCENE-6819 : Remove index-time boosts.
          Hide
          steve_rowe Steve Rowe added a comment -

          ExtractingRequestHandlerTest.testExtraction() has been failing since the master commit on this issue https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ed2b76, e.g. http://jenkins.sarowe.net/job/Lucene-Solr-tests-master/10364/:

             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testExtraction -Dtests.seed=F122EB76205DC618 -Dtests.slow=true -Dtests.locale=es-HN -Dtests.timezone=America/Inuvik -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
             [junit4] ERROR   0.12s J1 | ExtractingRequestHandlerTest.testExtraction <<<
             [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([F122EB76205DC618:48519F085C7516ED]:0)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:919)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:886)
             [junit4]    > 	at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:128)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//doc[1]/str[.='simple3']
             [junit4]    > 	xml response was: <?xml version="1.0" encoding="UTF-8"?>
             [junit4]    > <response>
             [junit4]    > <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="2" start="0"><doc><arr name="t_meta"><str>stream_size</str><str>365</str><str>X-Parsed-By</str><str>org.apache.tika.parser.DefaultParser</str><str>X-Parsed-By</str><str>org.apache.tika.parser.html.HtmlParser</str><str>stream_content_type</str><str>application/xml</str><str>stream_name</str><str>simple.html</str><str>stream_source_info</str><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str><str>dc:title</str><str>Welcome to Solr</str><str>Content-Encoding</str><str>ISO-8859-1</str><str>Content-Type</str><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_href"><str>rect</str><str>http://www.apache.org</str></arr><str name="id">simple2</str><arr name="stream_size"><str>365</str></arr><arr name="t_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr name="stream_content_type"><str>application/xml</str></arr><arr name="stream_name"><str>simple.html</str></arr><arr name="stream_source_info"><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str></arr><arr name="t_dc_title"><str>Welcome to Solr</str></arr><arr name="t_content_encoding"><str>ISO-8859-1</str></arr><arr name="title"><str>Welcome to Solr</str></arr><arr name="t_abcxyz"><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_content"><str> 
             [junit4]    >  
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >  Welcome to Solr 
             [junit4]    >  
             [junit4]    >  
             [junit4]    >  
             [junit4]    >   Here is some text
             [junit4]    >  
             [junit4]    >  distinct
             [junit4]    > words 
             [junit4]    >  Here is some text in a div 
             [junit4]    >  This has a  link . 
             [junit4]    >   </str></arr><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-03-03T09:02:52.622Z</date></doc><doc><arr name="t_meta"><str>stream_size</str><str>365</str><str>X-Parsed-By</str><str>org.apache.tika.parser.DefaultParser</str><str>X-Parsed-By</str><str>org.apache.tika.parser.html.HtmlParser</str><str>stream_content_type</str><str>application/xml</str><str>stream_name</str><str>simple.html</str><str>stream_source_info</str><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str><str>dc:title</str><str>Welcome to Solr</str><str>Content-Encoding</str><str>ISO-8859-1</str><str>Content-Type</str><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_href"><str>rect</str><str>http://www.apache.org</str></arr><str name="id">simple3</str><arr name="stream_size"><str>365</str></arr><arr name="t_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr name="stream_content_type"><str>application/xml</str></arr><arr name="stream_name"><str>simple.html</str></arr><arr name="stream_source_info"><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str></arr><arr name="t_dc_title"><str>Welcome to Solr</str></arr><arr name="t_content_encoding"><str>ISO-8859-1</str></arr><arr name="title"><str>Welcome to Solr</str></arr><arr name="t_content_type"><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_content"><str> 
             [junit4]    >  
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >   
             [junit4]    >  Welcome to Solr 
             [junit4]    >  
             [junit4]    >  
             [junit4]    >  
             [junit4]    >   Here is some text
             [junit4]    >  
             [junit4]    >  distinct
             [junit4]    > words 
             [junit4]    >  Here is some text in a div 
             [junit4]    >  This has a  link . 
             [junit4]    >   </str></arr><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-03-03T09:02:52.660Z</date></doc></result>
             [junit4]    > </response>
             [junit4]    > 	request was:q=t_href:http&qt=standard&start=0&rows=20&version=2.2
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:912)
          [...]
             [junit4]   2> NOTE: test params are: codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST, chunkSize=6, maxDocsPerChunk=8, blockSize=301), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, chunkSize=6, blockSize=301)), sim=RandomSimilarity(queryNorm=true): {}, locale=es-HN, timezone=America/Inuvik
             [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=423346232,total=485490688
          

          I got 10/10 failures when I beasted the test using Miller's beasting script on master HEAD.

          Show
          steve_rowe Steve Rowe added a comment - ExtractingRequestHandlerTest.testExtraction() has been failing since the master commit on this issue https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ed2b76 , e.g. http://jenkins.sarowe.net/job/Lucene-Solr-tests-master/10364/ : [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testExtraction -Dtests.seed=F122EB76205DC618 -Dtests.slow=true -Dtests.locale=es-HN -Dtests.timezone=America/Inuvik -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.12s J1 | ExtractingRequestHandlerTest.testExtraction <<< [junit4] > Throwable #1: java.lang.RuntimeException: Exception during query [junit4] > at __randomizedtesting.SeedInfo.seed([F122EB76205DC618:48519F085C7516ED]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:919) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:886) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:128) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//doc[1]/str[.='simple3'] [junit4] > xml response was: <?xml version="1.0" encoding="UTF-8"?> [junit4] > <response> [junit4] > <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="2" start="0"><doc><arr name="t_meta"><str>stream_size</str><str>365</str><str>X-Parsed-By</str><str>org.apache.tika.parser.DefaultParser</str><str>X-Parsed-By</str><str>org.apache.tika.parser.html.HtmlParser</str><str>stream_content_type</str><str>application/xml</str><str>stream_name</str><str>simple.html</str><str>stream_source_info</str><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str><str>dc:title</str><str>Welcome to Solr</str><str>Content-Encoding</str><str>ISO-8859-1</str><str>Content-Type</str><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_href"><str>rect</str><str>http://www.apache.org</str></arr><str name="id">simple2</str><arr name="stream_size"><str>365</str></arr><arr name="t_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr name="stream_content_type"><str>application/xml</str></arr><arr name="stream_name"><str>simple.html</str></arr><arr name="stream_source_info"><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str></arr><arr name="t_dc_title"><str>Welcome to Solr</str></arr><arr name="t_content_encoding"><str>ISO-8859-1</str></arr><arr name="title"><str>Welcome to Solr</str></arr><arr name="t_abcxyz"><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_content"><str> [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > Welcome to Solr [junit4] > [junit4] > [junit4] > [junit4] > Here is some text [junit4] > [junit4] > distinct [junit4] > words [junit4] > Here is some text in a div [junit4] > This has a link . [junit4] > </str></arr><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-03-03T09:02:52.622Z</date></doc><doc><arr name="t_meta"><str>stream_size</str><str>365</str><str>X-Parsed-By</str><str>org.apache.tika.parser.DefaultParser</str><str>X-Parsed-By</str><str>org.apache.tika.parser.html.HtmlParser</str><str>stream_content_type</str><str>application/xml</str><str>stream_name</str><str>simple.html</str><str>stream_source_info</str><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str><str>dc:title</str><str>Welcome to Solr</str><str>Content-Encoding</str><str>ISO-8859-1</str><str>Content-Type</str><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_href"><str>rect</str><str>http://www.apache.org</str></arr><str name="id">simple3</str><arr name="stream_size"><str>365</str></arr><arr name="t_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr name="stream_content_type"><str>application/xml</str></arr><arr name="stream_name"><str>simple.html</str></arr><arr name="stream_source_info"><str>file:/var/lib/jenkins/jobs/Lucene-Solr-tests-master/workspace/solr/contrib/extraction/src/test-files/extraction/simple.html</str></arr><arr name="t_dc_title"><str>Welcome to Solr</str></arr><arr name="t_content_encoding"><str>ISO-8859-1</str></arr><arr name="title"><str>Welcome to Solr</str></arr><arr name="t_content_type"><str>text/html; charset=ISO-8859-1</str></arr><arr name="t_content"><str> [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > [junit4] > Welcome to Solr [junit4] > [junit4] > [junit4] > [junit4] > Here is some text [junit4] > [junit4] > distinct [junit4] > words [junit4] > Here is some text in a div [junit4] > This has a link . [junit4] > </str></arr><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-03-03T09:02:52.660Z</date></doc></result> [junit4] > </response> [junit4] > request was:q=t_href:http&qt=standard&start=0&rows=20&version=2.2 [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:912) [...] [junit4] 2> NOTE: test params are: codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST, chunkSize=6, maxDocsPerChunk=8, blockSize=301), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, chunkSize=6, blockSize=301)), sim=RandomSimilarity(queryNorm=true): {}, locale=es-HN, timezone=America/Inuvik [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=423346232,total=485490688 I got 10/10 failures when I beasted the test using Miller's beasting script on master HEAD.
          Hide
          jpountz Adrien Grand added a comment -

          Thanks Steve, I just pushed a fix. Sorry for the noise.

          Show
          jpountz Adrien Grand added a comment - Thanks Steve, I just pushed a fix. Sorry for the noise.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7453f78b3539c7f4f5fa6e5324b251467ca50644 in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7453f78 ]

          LUCENE-6819: Make ExtractingRequestHandlerTest not rely on index-time boosts.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7453f78b3539c7f4f5fa6e5324b251467ca50644 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7453f78 ] LUCENE-6819 : Make ExtractingRequestHandlerTest not rely on index-time boosts.

            People

            • Assignee:
              Unassigned
              Reporter:
              jpountz Adrien Grand
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development