Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2116

TikaEntityProcessor does not find parser by default

    XMLWordPrintableJSON

Details

    Description

      The TikaEntityProcessor does not find the correct document parser by default.
      This is in a two-level DIH config file. I have attached pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test this, you will need the current 3.x branch or 4.0 trunk.

      1. Set up a Tika-enabled Solr
      2. copy any PDF file to /tmp/testfile.pdf
      3. copy the pdflist-data-config.xml to your solr/conf
      4. and add this snippet to your solrconfig.xml
        <requestHandler name="/pdflist"
              class="org.apache.solr.handler.dataimport.DataImportHandler">
          <lst name="defaults">
                      <str name="config">pdflist-data-config.xml</str>
              </lst>
        </requestHandler>
        

      http://localhost:8983/solr/pdflist?command=full-import will make one document with the id and text fields populated. If you remove this line:

       parser="org.apache.tika.parser.pdf.PDFParser"
      

      from the TikaEntityProcessor entity, the parser will not be found and you will get a document with the "id" field and nothing else.

      Attachments

        1. pdflist.xml
          0.1 kB
          Lance Norskog
        2. pdflist-data-config.xml
          0.9 kB
          Lance Norskog
        3. SOLR-2116.patch
          3 kB
          Martijn van Groningen

        Issue Links

          Activity

            People

              hossman Chris M. Hostetter
              lancenorskog Lance Norskog
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: