Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7174

DIH should reset TikaEntityProcessor so that it is capable of re-use.

    XMLWordPrintableJSON

Details

    Description

      Downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr create -c hn2" to create a new core.

      I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf):

      <dataConfig>
      <dataSource type="BinFileDataSource" name="bin" />
      <document>
      <entity name="files" dataSource="null" rootEntity="false"
      processor="FileListEntityProcessor"
      baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
      onError="skip"
      recursive="true">
      <field column="fileAbsolutePath" name="id" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastModified" />

      <entity name="documentImport" processor="TikaEntityProcessor"
      url="${files.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
      <field column="file" name="fileName"/>
      <field column="Author" name="author" meta="true"/>
      <field column="title" name="title" meta="true"/>
      <field column="text" name="content"/>
      </entity>
      </entity>
      </document>
      </dataConfig>

      In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml:

      <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
      <str name="config">data-import.xml</str>
      </lst>
      </requestHandler>

      I renamed managed-schema to schema.xml, and ensured the following doc fields were setup:

      <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
      <field name="fileName" type="string" indexed="true" stored="true" />
      <field name="author" type="string" indexed="true" stored="true" />
      <field name="title" type="string" indexed="true" stored="true" />

      <field name="size" type="long" indexed="true" stored="true" />
      <field name="lastModified" type="date" indexed="true" stored="true" />

      <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
      <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

      <copyField source="content" dest="text"/>

      I copied all the jars from dist and contrib* into server\solr\lib.

      Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back

      All good so far.

      Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import.

      But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie. it only adds one document (the very first one) even though it's iterated over 58!

      No errors are reported in the logs.

      I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows specific.

      -----------------

      If I change the data-import.xml to use FileDataSource and PlainTextEntityProcessor and parse txt files, eg:

      <dataConfig>
      <dataSource type="FileDataSource" name="bin" />
      <document>
      <entity name="files" dataSource="null" rootEntity="false"
      processor="FileListEntityProcessor"
      baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
      <field column="fileAbsolutePath" name="id" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastModified" />

      <entity name="documentImport" processor="PlainTextEntityProcessor"
      url="${files.fileAbsolutePath}" format="text" dataSource="bin">
      <field column="plainText" name="content"/>
      </entity>
      </entity>
      </document>
      </dataConfig>

      This works. So it's a combo of BinFileDataSource and TikaEntityProcessor that is failing.

      On Windows, I ran Process Monitor, and spotted that only the very first epub file is actually being read (repeatedly).

      With verbose and debug on when running the DIH, I get the following response:

      ....
      "verbose-output": [
      "entity:files",
      [
      null,
      "----------- row #1-------------",
      "fileSize",
      2609004,
      "fileLastModified",
      "2015-02-25T11:37:25.217Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue018.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue018.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#1",
      [
      "query",
      "c:\\Users\\gt\\Documents\\epub
      issue018.epub",
      "time-taken",
      "0:0:0.0",
      null,
      "----------- row #1-------------",
      "text",
      "< ... parsed epub text - snip ... >"
      "title",
      "Issue 18 title",
      "Author",
      "Author text",
      null,
      "---------------------------------------------"
      ],
      "document#2",
      []
      ],
      null,
      "----------- row #2-------------",
      "fileSize",
      4428804,
      "fileLastModified",
      "2015-02-25T11:37:36.399Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue019.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue019.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#2",
      []
      ],
      null,
      "----------- row #3-------------",
      "fileSize",
      2580266,
      "fileLastModified",
      "2015-02-25T11:37:41.188Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub
      issue020.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents
      epub",
      "file",
      "issue020.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
      "document#2",
      []
      ],
      ....
      ....

      Attachments

        1. SOLR-7174.patch
          0.8 kB
          Alexandre Rafalovitch

        Issue Links

          Activity

            People

              noble.paul Noble Paul
              gtinovem Gary Taylor
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: