[SOLR-2116] TikaEntityProcessor does not find parser by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
Labels:
None

Description

The TikaEntityProcessor does not find the correct document parser by default.
This is in a two-level DIH config file. I have attached pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test this, you will need the current 3.x branch or 4.0 trunk.

Set up a Tika-enabled Solr
copy any PDF file to /tmp/testfile.pdf
copy the pdflist-data-config.xml to your solr/conf

and add this snippet to your solrconfig.xml

<requestHandler name="/pdflist"
      class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
              <str name="config">pdflist-data-config.xml</str>
      </lst>
</requestHandler>

http://localhost:8983/solr/pdflist?command=full-import will make one document with the id and text fields populated. If you remove this line:

 parser="org.apache.tika.parser.pdf.PDFParser"

from the TikaEntityProcessor entity, the parser will not be found and you will get a document with the "id" field and nothing else.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pdflist.xml
09/Sep/10 00:23
0.1 kB
Lance Norskog
pdflist-data-config.xml
09/Sep/10 00:23
0.9 kB
Lance Norskog
SOLR-2116.patch
03/Jan/11 22:21
3 kB
Martijn van Groningen

Issue Links

is duplicated by

SOLR-2101 TikaEntityProcessor does not extract files- does not pick parser correctly

Closed

Activity

People

Assignee:: Chris M. Hostetter

Reporter:: Lance Norskog

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Sep/10 00:22

Updated:: 30/Mar/11 15:45

Resolved:: 19/Feb/11 01:57