Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2869

Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.20
    • Fix Version/s: 1.21
    • Component/s: app, cli, parser
    • Labels:
      None
    • Environment:

      Windows 10 (1809 - 17763.437)

      Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
      Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)

      Description

      I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working.

      java -jar tika-app-1.20.jar 0001.127_342_5_7955.pdf

      mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
      See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
      for optional dependencies.

      mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
      Please provide the jar on your classpath to parse sqlite files.
      See tika-parsers/pom.xml for the correct version.
      Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
      Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object truncated by 465479
      at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
      at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
      at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
      at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
      at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      ... 5 more
      Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
      at java.io.BufferedInputStream.read1(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
      at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at java.io.FilterInputStream.read(Unknown Source)
      at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
      ... 10 more

       

       

      java -jar tika-app-1.19.1.jar 0001.127_342_5_7955.pdfmai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
      See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
      for optional dependencies.mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
      Please provide the jar on your classpath to parse sqlite files.
      See tika-parsers/pom.xml for the correct version.<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <meta name="date" content="2019-03-15T12:36:08Z"/>...CORRECT XML OUTPUT...

        Attachments

        1. 0001.127_342_5_7955.pdf
          455 kB
          Edans Sandes

          Activity

            People

            • Assignee:
              tallison Tim Allison
              Reporter:
              edanssandes Edans Sandes
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: