Solr
  1. Solr
  2. SOLR-1856

In Solr Cell, literals should override Tika-parsed values

    Details

      Description

      I propose that ExtractingRequestHandler / SolrCell literals should take precedence over Tika-parsed metadata in all situations, including where multiValued="true". (Compare SOLR-1633?)

      My personal motivation is that I have several fields (e.g. "title", "date") where my own metadata is much superior to what Tika offers, and I want to throw those Tika values away. (I actually wouldn't mind throwing away all Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential approach to this, but the fix here might be simpler.

      I'll attach a patch shortly.

      1. SOLR-1856.patch
        8 kB
        Jan Høydahl
      2. SOLR-1856.patch
        5 kB
        Chris Harris

        Issue Links

          Activity

          Hide
          Chris Harris added a comment -

          Initial patch. Notes:

          • We allow literal values to override all other Tika/SolrCell stuff, including 1) fields in the Tika metadata object, 2) the Tika content field, and 3) any "captured content" fields
          • Currently literalValuesOverrideOtherValues is always true. This could be made a config option, but my intuition so far is that it's not worth the complication.
          • Includes an initial unit test
          • Interestingly, all the old (and unmodified) unit tests still pass.
          Show
          Chris Harris added a comment - Initial patch. Notes: We allow literal values to override all other Tika/SolrCell stuff, including 1) fields in the Tika metadata object, 2) the Tika content field, and 3) any "captured content" fields Currently literalValuesOverrideOtherValues is always true. This could be made a config option, but my intuition so far is that it's not worth the complication. Includes an initial unit test Interestingly, all the old (and unmodified) unit tests still pass.
          Hide
          Lance Norskog added a comment -

          Drop 1803 also

          Show
          Lance Norskog added a comment - Drop 1803 also
          Hide
          Ravish Bhagdev added a comment -

          This will be very useful.

          Show
          Ravish Bhagdev added a comment - This will be very useful.
          Hide
          Jan Høydahl added a comment -

          Updated patch for trunk, with /trunk as base, not /solr.

          I added the request param literalsOverride=true|false which defaults to true, and documented it at http://wiki.apache.org/solr/ExtractingRequestHandler

          Think this is ready for commit, will then backport to 4.x

          Show
          Jan Høydahl added a comment - Updated patch for trunk, with /trunk as base, not /solr. I added the request param literalsOverride=true|false which defaults to true, and documented it at http://wiki.apache.org/solr/ExtractingRequestHandler Think this is ready for commit, will then backport to 4.x
          Hide
          Jan Høydahl added a comment -

          Committed to trunk r1354455 and branch_4x r1354460

          Show
          Jan Høydahl added a comment - Committed to trunk r1354455 and branch_4x r1354460
          Hide
          Hoss Man added a comment -

          hoss20120711-manual-post-40alpha-change

          Show
          Hoss Man added a comment - hoss20120711-manual-post-40alpha-change
          Hide
          Simon Endele added a comment -

          Debugging the code (Solr 4.4.0) I found out that the parameter "lowernames" is not considered.
          The request "lowernames=true&literalsOverride=true&literal.url=myurl" still raises an org.apache.solr.common.SolrException: "ERROR: multiple values encountered for non multiValued field url: [.., ..]", if a URL is extracted from the metadata of the binary.

          Show
          Simon Endele added a comment - Debugging the code (Solr 4.4.0) I found out that the parameter "lowernames" is not considered. The request "lowernames=true&literalsOverride=true&literal.url=myurl" still raises an org.apache.solr.common.SolrException: "ERROR: multiple values encountered for non multiValued field url: [.., ..] ", if a URL is extracted from the metadata of the binary.
          Hide
          Jan Høydahl added a comment -

          Please file a new bug report if you believe this is something that should be fixed.

          Show
          Jan Høydahl added a comment - Please file a new bug report if you believe this is something that should be fixed.
          Hide
          Simon Endele added a comment -

          Did so, see SOLR-5375.

          Show
          Simon Endele added a comment - Did so, see SOLR-5375 .

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Chris Harris
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development