Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-984

Give Tika's metadata some hints

    XMLWordPrintableJSON

Details

    Description

      Component: Tika connector

      Currently in trunk code, we don't set data in Tika's metadata object.
      We likely have to give metadata some hints to detect and extract from document.

      • resourceName
      • ContentType
      • stream size
      • charset(new feature)
      • Password handling(new feature)

      Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need to decide to ignore or not about the parsing document. Solr Cell has 'ignoreTikaException' param. When TikaException is thrown, if true, metadata only is indexed, if false, Solr responds server error and the document is not indexed.

      Reference-->Solr Cell:
      http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142

      Attachments

        Activity

          People

            shinichiro abe Shinichiro Abe
            shinichiro abe Shinichiro Abe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: