Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2351

Getting error while parsing documents

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.14
    • Fix Version/s: None
    • Component/s: general
    • Labels:
    • Environment:

      Red Hat Enterprise Linux Server release 7.3
      ElasticSearch 5.2.1
      ingest-attachment 5.2.1

    • Docs Text:
      any docs other than .txt

      Description

      Hi Everyone,

      I am using Ingest-attachment for indexing documents. I am able to parse text documents (.txt files). When I try to parse .doc or pdf files getting this error.

      FILE = /elastic/files/englishAnalyzer.doc
      ID = 6

      "error" : {
      "root_cause" : [
      {
      "type" : "exception",
      "reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExc
      eption[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
      ",
      "header" : {
      "processor_type" : "attachment"
      }
      }
      ],
      "type" : "exception",
      "reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExcepti
      on[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
      "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException fro
      m org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
      "caused_by" : {
      "type" : "parse_exception",
      "reason" : "Error parsing document in field [data]",
      "caused_by" : {
      "type" : "tika_exception",
      "reason" : "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079",
      "caused_by" : {
      "type" : "array_index_out_of_bounds_exception",
      "reason" : "-1"
      }
      }
      }
      },
      "header" : {
      "processor_type" : "attachment"
      }
      },
      "status" : 500
      }

      Please help me to resolve the issue

        Attachments

        1. 04 - stackTrace.txt
          5 kB
          VENU
        2. 03 - Json_creat_code.txt
          1 kB
          VENU
        3. 02 - Pipeline.txt
          0.4 kB
          VENU
        4. 01 - Templete.txt
          4 kB
          VENU
        5. englishAnalyzer.doc
          26 kB
          VENU

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              venuambati VENU
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: