Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1856

In Solr Cell, literals should override Tika-parsed values

    Details

      Description

      I propose that ExtractingRequestHandler / SolrCell literals should take precedence over Tika-parsed metadata in all situations, including where multiValued="true". (Compare SOLR-1633?)

      My personal motivation is that I have several fields (e.g. "title", "date") where my own metadata is much superior to what Tika offers, and I want to throw those Tika values away. (I actually wouldn't mind throwing away all Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential approach to this, but the fix here might be simpler.

      I'll attach a patch shortly.

      1. SOLR-1856.patch
        5 kB
        Chris Harris
      2. SOLR-1856.patch
        8 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          simon.endele Simon Endele added a comment -

          Did so, see SOLR-5375.

          Show
          simon.endele Simon Endele added a comment - Did so, see SOLR-5375 .
          Hide
          janhoy Jan Høydahl added a comment -

          Please file a new bug report if you believe this is something that should be fixed.

          Show
          janhoy Jan Høydahl added a comment - Please file a new bug report if you believe this is something that should be fixed.
          Hide
          simon.endele Simon Endele added a comment -

          Debugging the code (Solr 4.4.0) I found out that the parameter "lowernames" is not considered.
          The request "lowernames=true&literalsOverride=true&literal.url=myurl" still raises an org.apache.solr.common.SolrException: "ERROR: multiple values encountered for non multiValued field url: [.., ..]", if a URL is extracted from the metadata of the binary.

          Show
          simon.endele Simon Endele added a comment - Debugging the code (Solr 4.4.0) I found out that the parameter "lowernames" is not considered. The request "lowernames=true&literalsOverride=true&literal.url=myurl" still raises an org.apache.solr.common.SolrException: "ERROR: multiple values encountered for non multiValued field url: [.., ..] ", if a URL is extracted from the metadata of the binary.
          Hide
          hossman Hoss Man added a comment -

          hoss20120711-manual-post-40alpha-change

          Show
          hossman Hoss Man added a comment - hoss20120711-manual-post-40alpha-change
          Hide
          janhoy Jan Høydahl added a comment -

          Committed to trunk r1354455 and branch_4x r1354460

          Show
          janhoy Jan Høydahl added a comment - Committed to trunk r1354455 and branch_4x r1354460
          Hide
          janhoy Jan Høydahl added a comment -

          Updated patch for trunk, with /trunk as base, not /solr.

          I added the request param literalsOverride=true|false which defaults to true, and documented it at http://wiki.apache.org/solr/ExtractingRequestHandler

          Think this is ready for commit, will then backport to 4.x

          Show
          janhoy Jan Høydahl added a comment - Updated patch for trunk, with /trunk as base, not /solr. I added the request param literalsOverride=true|false which defaults to true, and documented it at http://wiki.apache.org/solr/ExtractingRequestHandler Think this is ready for commit, will then backport to 4.x
          Hide
          ravish Ravish Bhagdev added a comment -

          This will be very useful.

          Show
          ravish Ravish Bhagdev added a comment - This will be very useful.
          Hide
          lancenorskog Lance Norskog added a comment -

          Drop 1803 also

          Show
          lancenorskog Lance Norskog added a comment - Drop 1803 also
          Hide
          ryguasu Chris Harris added a comment -

          Initial patch. Notes:

          • We allow literal values to override all other Tika/SolrCell stuff, including 1) fields in the Tika metadata object, 2) the Tika content field, and 3) any "captured content" fields
          • Currently literalValuesOverrideOtherValues is always true. This could be made a config option, but my intuition so far is that it's not worth the complication.
          • Includes an initial unit test
          • Interestingly, all the old (and unmodified) unit tests still pass.
          Show
          ryguasu Chris Harris added a comment - Initial patch. Notes: We allow literal values to override all other Tika/SolrCell stuff, including 1) fields in the Tika metadata object, 2) the Tika content field, and 3) any "captured content" fields Currently literalValuesOverrideOtherValues is always true. This could be made a config option, but my intuition so far is that it's not worth the complication. Includes an initial unit test Interestingly, all the old (and unmodified) unit tests still pass.

            People

            • Assignee:
              janhoy Jan Høydahl
              Reporter:
              ryguasu Chris Harris
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development