Solr
  1. Solr
  2. SOLR-4530

DIH: Provide configuration to use Tika's IdentityHtmlMapper

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.3
    • Labels:
      None

      Description

      When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most of the HTML. It may make sense when the expectation is just to store the extracted content as a text blob, but DIH allows more fine-tuned content extraction (e.g. with nested XPathEntityProcessor).

      Recent Tika versions allow to set an alternative HTML Mapper implementation that passes all the HTML in. It would be useful to be able to set that implementation from DIH configuration.

      1. SOLR-4530.patch
        5 kB
        Alexandre Rafalovitch

        Activity

        Show
        Alexandre Rafalovitch added a comment - Proposed implementation and tests: https://github.com/arafalov/lucene-solr/commit/bef2f84fd6943241c0f720f17011e5e42d919914
        Hide
        Hoss Man added a comment -

        Hmmm...

        I applied this path URL to trunk and got a failure in the modified tests...
        https://github.com/arafalov/lucene-solr/commit/bef2f84fd6943241c0f720f17011e5e42d919914.patch

        [junit4:junit4]   2> 2321 T10 oas.SolrTestCaseJ4.tearDown ###Ending testTikaHTMLMapperIdentity
        [junit4:junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity -Dtests.seed=699D812F169C4A5E -Dtests.slow=true -Dtests.locale=el -Dtests.timezone=America/Noronha -Dtests.file.encoding=UTF-8
        [junit4:junit4] ERROR   0.11s J0 | TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<<
        [junit4:junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
        [junit4:junit4]    > 	at __randomizedtesting.SeedInfo.seed([699D812F169C4A5E:39E205BEDFA8BFA3]:0)
        [junit4:junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:524)
        [junit4:junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:491)
        [junit4:junit4]    > 	at org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:101)
        ...
        [junit4:junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//str[@name='text'][contains(.,'<H1>')]
        [junit4:junit4]    > 	xml response was: <?xml version="1.0" encoding="UTF-8"?>
        [junit4:junit4]    > <response>
        [junit4:junit4]    > <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="start">0</str><str name="q">*:*</str><str name="qt">standard</str><str name="rows">20</str><str name="version">2.2</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="text">&lt;?xml version="1.0" encoding="UTF-8"?&gt;&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
        [junit4:junit4]    > &lt;head&gt;
        [junit4:junit4]    > &lt;meta name="Content-Encoding" content="ISO-8859-1"/&gt;
        [junit4:junit4]    > &lt;meta name="Content-Type" content="text/html; charset=ISO-8859-1"/&gt;
        [junit4:junit4]    > &lt;meta name="dc:title" content="Title in the header"/&gt;
        [junit4:junit4]    > &lt;title&gt;Title in the header&lt;/title&gt;
        [junit4:junit4]    > &lt;/head&gt;
        [junit4:junit4]    > &lt;body&gt;
        [junit4:junit4]    > &lt;h1&gt;H1 Header&lt;/h1&gt;
        [junit4:junit4]    > 
        [junit4:junit4]    > &lt;div&gt;Basic div&lt;/div&gt;
        [junit4:junit4]    > 
        [junit4:junit4]    > &lt;div class="classAttribute"&gt;Div with attribute&lt;/div&gt;
        [junit4:junit4]    > 
        [junit4:junit4]    > &lt;/body&gt;&lt;/html&gt;</str></doc></result>
        [junit4:junit4]    > </response>
        [junit4:junit4]    > 
        [junit4:junit4]    > 	request was:start=0&q=*:*&qt=standard&rows=20&version=2.2
        [junit4:junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:517)
        [junit4:junit4]    > 	... 42 more
        

        ...suggesting maybe the comment about uppercasing/lowercasing tags in tika isn't consistent across platforms? (or maybe you previously tested against a slightly diff version of tika and the behavior has changed?

        Show
        Hoss Man added a comment - Hmmm... I applied this path URL to trunk and got a failure in the modified tests... https://github.com/arafalov/lucene-solr/commit/bef2f84fd6943241c0f720f17011e5e42d919914.patch [junit4:junit4] 2> 2321 T10 oas.SolrTestCaseJ4.tearDown ###Ending testTikaHTMLMapperIdentity [junit4:junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity -Dtests.seed=699D812F169C4A5E -Dtests.slow=true -Dtests.locale=el -Dtests.timezone=America/Noronha -Dtests.file.encoding=UTF-8 [junit4:junit4] ERROR 0.11s J0 | TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<< [junit4:junit4] > Throwable #1: java.lang.RuntimeException: Exception during query [junit4:junit4] > at __randomizedtesting.SeedInfo.seed([699D812F169C4A5E:39E205BEDFA8BFA3]:0) [junit4:junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:524) [junit4:junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:491) [junit4:junit4] > at org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:101) ... [junit4:junit4] > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//str[@name='text'][contains(.,'<H1>')] [junit4:junit4] > xml response was: <?xml version="1.0" encoding="UTF-8"?> [junit4:junit4] > <response> [junit4:junit4] > <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="start">0</str><str name="q">*:*</str><str name="qt">standard</str><str name="rows">20</str><str name="version">2.2</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="text">&lt;?xml version="1.0" encoding="UTF-8"?&gt;&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt; [junit4:junit4] > &lt;head&gt; [junit4:junit4] > &lt;meta name="Content-Encoding" content="ISO-8859-1"/&gt; [junit4:junit4] > &lt;meta name="Content-Type" content="text/html; charset=ISO-8859-1"/&gt; [junit4:junit4] > &lt;meta name="dc:title" content="Title in the header"/&gt; [junit4:junit4] > &lt;title&gt;Title in the header&lt;/title&gt; [junit4:junit4] > &lt;/head&gt; [junit4:junit4] > &lt;body&gt; [junit4:junit4] > &lt;h1&gt;H1 Header&lt;/h1&gt; [junit4:junit4] > [junit4:junit4] > &lt;div&gt;Basic div&lt;/div&gt; [junit4:junit4] > [junit4:junit4] > &lt;div class="classAttribute"&gt;Div with attribute&lt;/div&gt; [junit4:junit4] > [junit4:junit4] > &lt;/body&gt;&lt;/html&gt;</str></doc></result> [junit4:junit4] > </response> [junit4:junit4] > [junit4:junit4] > request was:start=0&q=*:*&qt=standard&rows=20&version=2.2 [junit4:junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:517) [junit4:junit4] > ... 42 more ...suggesting maybe the comment about uppercasing/lowercasing tags in tika isn't consistent across platforms? (or maybe you previously tested against a slightly diff version of tika and the behavior has changed?
        Hide
        Alexandre Rafalovitch added a comment -

        Could be different version of Tika, as I tested it against Solr 4.1 originally. I will retest. Should I be retesting against trunk or against 4.2 (4.2.1? 4.3?) if I want this make it to a 4.x sub-release?

        Show
        Alexandre Rafalovitch added a comment - Could be different version of Tika, as I tested it against Solr 4.1 originally. I will retest. Should I be retesting against trunk or against 4.2 (4.2.1? 4.3?) if I want this make it to a 4.x sub-release?
        Hide
        Hoss Man added a comment -

        features are always added to trunk first, and then backported to 4 based on feasibility & stability.

        Show
        Hoss Man added a comment - features are always added to trunk first, and then backported to 4 based on feasibility & stability.
        Hide
        Alexandre Rafalovitch added a comment -

        The case issue was apparently a bug, fixed in TIKA-869.

        I fixed that and applied changes to trunk. Patch is included, tests seem to pass.

        Show
        Alexandre Rafalovitch added a comment - The case issue was apparently a bug, fixed in TIKA-869 . I fixed that and applied changes to trunk. Patch is included, tests seem to pass.
        Hide
        Alexandre Rafalovitch added a comment -

        Patch against trunk.

        Show
        Alexandre Rafalovitch added a comment - Patch against trunk.
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed to trunk and branch_4x

        Show
        Shalin Shekhar Mangar added a comment - Committed to trunk and branch_4x
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Alexandre Rafalovitch
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development