Solr
  1. Solr
  2. SOLR-7174

DIH should reset TikaEntityProcessor so that it is capable of re-use.

    Details

      Description

      Downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr create -c hn2" to create a new core.

      I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf):

      <dataConfig>
      <dataSource type="BinFileDataSource" name="bin" />
      <document>
      <entity name="files" dataSource="null" rootEntity="false"
      processor="FileListEntityProcessor"
      baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
      onError="skip"
      recursive="true">
      <field column="fileAbsolutePath" name="id" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastModified" />

      <entity name="documentImport" processor="TikaEntityProcessor"
      url="$

      {files.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
      <field column="file" name="fileName"/>
      <field column="Author" name="author" meta="true"/>
      <field column="title" name="title" meta="true"/>
      <field column="text" name="content"/>
      </entity>
      </entity>
      </document>
      </dataConfig>

      In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml:

      <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
      <str name="config">data-import.xml</str>
      </lst>
      </requestHandler>

      I renamed managed-schema to schema.xml, and ensured the following doc fields were setup:

      <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
      <field name="fileName" type="string" indexed="true" stored="true" />
      <field name="author" type="string" indexed="true" stored="true" />
      <field name="title" type="string" indexed="true" stored="true" />

      <field name="size" type="long" indexed="true" stored="true" />
      <field name="lastModified" type="date" indexed="true" stored="true" />

      <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
      <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

      <copyField source="content" dest="text"/>

      I copied all the jars from dist and contrib* into server\solr\lib.

      Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back

      All good so far.

      Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import.

      But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie. it only adds one document (the very first one) even though it's iterated over 58!

      No errors are reported in the logs.

      I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows specific.

      -----------------

      If I change the data-import.xml to use FileDataSource and PlainTextEntityProcessor and parse txt files, eg:

      <dataConfig>
      <dataSource type="FileDataSource" name="bin" />
      <document>
      <entity name="files" dataSource="null" rootEntity="false"
      processor="FileListEntityProcessor"
      baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
      <field column="fileAbsolutePath" name="id" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastModified" />

      <entity name="documentImport" processor="PlainTextEntityProcessor"
      url="${files.fileAbsolutePath}

      " format="text" dataSource="bin">
      <field column="plainText" name="content"/>
      </entity>
      </entity>
      </document>
      </dataConfig>

      This works. So it's a combo of BinFileDataSource and TikaEntityProcessor that is failing.

      On Windows, I ran Process Monitor, and spotted that only the very first epub file is actually being read (repeatedly).

      With verbose and debug on when running the DIH, I get the following response:

      ....
      "verbose-output": [
      "entity:files",
      [
      null,
      "----------- row #1-------------",
      "fileSize",
      2609004,
      "fileLastModified",
      "2015-02-25T11:37:25.217Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue018.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue018.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#1",
      [
      "query",
      "c:\\Users\\gt\\Documents\\epub
      issue018.epub",
      "time-taken",
      "0:0:0.0",
      null,
      "----------- row #1-------------",
      "text",
      "< ... parsed epub text - snip ... >"
      "title",
      "Issue 18 title",
      "Author",
      "Author text",
      null,
      "---------------------------------------------"
      ],
      "document#2",
      []
      ],
      null,
      "----------- row #2-------------",
      "fileSize",
      4428804,
      "fileLastModified",
      "2015-02-25T11:37:36.399Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue019.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue019.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#2",
      []
      ],
      null,
      "----------- row #3-------------",
      "fileSize",
      2580266,
      "fileLastModified",
      "2015-02-25T11:37:41.188Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue020.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue020.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#2",
      []
      ],
      ....
      ....

      1. SOLR-7174.patch
        0.8 kB
        Alexandre Rafalovitch

        Issue Links

          Activity

          Hide
          Alexandre Rafalovitch added a comment -

          It looks like the TikaEntityProcessor is not capable of re-entry. This is only triggered when it is an inner entity. The title of the JIRA should probably be renamed.

          The cause is a flag *done* which is set to false in firstInit, is set to true at the end of the first run and is not reset before a second (reused) run.

          One solution is to override init method (and not just firstInit) and move resetting the flag there.

          Show
          Alexandre Rafalovitch added a comment - It looks like the TikaEntityProcessor is not capable of re-entry. This is only triggered when it is an inner entity. The title of the JIRA should probably be renamed. The cause is a flag * done * which is set to false in firstInit, is set to true at the end of the first run and is not reset before a second (reused) run. One solution is to override init method (and not just firstInit ) and move resetting the flag there.
          Hide
          Alexandre Rafalovitch added a comment -

          Proposed patch that allows TikaEntityProcessor to reset currently on reuse.

          Show
          Alexandre Rafalovitch added a comment - Proposed patch that allows TikaEntityProcessor to reset currently on reuse.
          Hide
          Gary Taylor added a comment -

          Patched tested OK. I can now use the above DIH and schema config to index multiple epub docs via TikaEntityProcessor. Thanks!

          Show
          Gary Taylor added a comment - Patched tested OK. I can now use the above DIH and schema config to index multiple epub docs via TikaEntityProcessor. Thanks!
          Hide
          Noble Paul added a comment -

          The patch looks fine to me , I shall commit this soon

          Show
          Noble Paul added a comment - The patch looks fine to me , I shall commit this soon
          Hide
          ASF subversion and git services added a comment -

          Commit 1663857 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1663857 ]

          SOLR-7174: DIH should reset TikaEntityProcessor so that it is capable of re-use

          Show
          ASF subversion and git services added a comment - Commit 1663857 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1663857 ] SOLR-7174 : DIH should reset TikaEntityProcessor so that it is capable of re-use
          Hide
          ASF subversion and git services added a comment -

          Commit 1663858 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1663858 ]

          SOLR-7174: DIH should reset TikaEntityProcessor so that it is capable of re-use

          Show
          ASF subversion and git services added a comment - Commit 1663858 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1663858 ] SOLR-7174 : DIH should reset TikaEntityProcessor so that it is capable of re-use
          Hide
          Shalin Shekhar Mangar added a comment -

          This issue has been added in the "Other Changes" section of CHANGES.txt. This should be put under the "Bug Fixes" section.

          Show
          Shalin Shekhar Mangar added a comment - This issue has been added in the "Other Changes" section of CHANGES.txt. This should be put under the "Bug Fixes" section.
          Hide
          Noble Paul added a comment -

          It's not a bug
          On Mar 4, 2015 10:48 AM, "Shalin Shekhar Mangar (JIRA)" <jira@apache.org>

          Show
          Noble Paul added a comment - It's not a bug On Mar 4, 2015 10:48 AM, "Shalin Shekhar Mangar (JIRA)" <jira@apache.org>
          Hide
          Gary Taylor added a comment -

          Patch verified OK.

          Show
          Gary Taylor added a comment - Patch verified OK.
          Hide
          Alexandre Rafalovitch added a comment -

          This may actually be a regression, see SOLR-7222 . Which means we need to change the CHANGES.txt, but also that something else maybe affected.

          So, it is either Tika upgrade that did it or something in DIH. Possibly related to RecursiveParserWrapper mentioned in SOLR-7189.

          Show
          Alexandre Rafalovitch added a comment - This may actually be a regression, see SOLR-7222 . Which means we need to change the CHANGES.txt, but also that something else maybe affected. So, it is either Tika upgrade that did it or something in DIH. Possibly related to RecursiveParserWrapper mentioned in SOLR-7189 .
          Hide
          Tim Allison added a comment -

          Could be Tika, but it isn't RecursiveParserWrapper. That has to be called in the invoking code (e.g. it isn't under the hood of AutoDetectParser), and it would wrap AutoDetectParser or the user configured parser.

          Show
          Tim Allison added a comment - Could be Tika, but it isn't RecursiveParserWrapper. That has to be called in the invoking code (e.g. it isn't under the hood of AutoDetectParser), and it would wrap AutoDetectParser or the user configured parser.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Gary Taylor
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development