Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2532

Output for PDF file contains X-TIKA:content that is a PDF fragment

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.15, 1.16, 1.17
    • None
    • parser
    • None
    • Ubuntu 64 bit
      JDK 1.8

    Description

      I have a PDF file that returns two elements in the recursive json output. The first element is text, as expected. The second element seems to be a fragment of a PDF file, rather than extracted text.

      The start of the second element in the json output is:
      {
      "Content-Encoding": "ISO-8859-1",
      "Content-Length": "-1",
      "Content-Type": "text/plain; charset\u003dISO-8859-1",
      "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.txt.TXTParser"
      ],
      "X-TIKA:content": "\u003c\u003c\n /ASCII85EncodePages false\n /AllowTransparency false\n /AutoPositionEPSFiles true\n /AutoRotatePages /None\n /Binding /Left\n /CalGrayProfile (Gray Gamma 2.2)\n /CalRGBProfile (sRGB IEC61966-2.1)\n /CalCMYKProfile (U.S. Web Coated \\050SWOP
      051 v2)\n /sRGBProfile (sRGB IEC61966-2.1)\n /CannotEmbedFontPolicy /Warning\n /CompatibilityLevel 1.4\n /CompressObjects /Off\n /CompressPages true\n /ConvertImagesToIndexed true\n /PassThroughJPEGImages true\n /CreateJobTicket false\n /DefaultRenderingIntent /Default\n /DetectBlends true\n /DetectCurves 0.0000\n /ColorConversionStrategy /LeaveColorUnchanged\n /DoThumbnails true\n /EmbedAllFonts true\n /EmbedOpenType false\n /ParseICCProfilesInComments true\n /EmbedJobOptions true\n /DSCReportingLevel 0\n /EmitDSCWarnings false\n /EndPage 1\n /ImageMemory 1048576\n /LockDistillerParams true\n /MaxSubsetPct 100\n /Optimize true\n /OPM 0\n /ParseDSCComments false\n /ParseDSCCommentsForDocInfo false\n /PreserveCopyPage true\n /PreserveDICMYKValues true\n /PreserveEPSInfo false\n /PreserveFlatness true\n /PreserveHalftoneInfo true\n /PreserveOPIComments false\n /PreserveOverprintSettings true\n /StartPage 1\n /SubsetFonts true\n /TransferFunctionInfo /Remove\n /UCRandBGInfo /Preserve\n /UsePrologue false\n /ColorSettingsFile ()\n /AlwaysEmbed [ true\n /AbadiMT-CondensedLight\n /ACaslon-Italic\n /ACaslon

      Attachments

        Activity

          People

            Unassigned Unassigned
            tyann Trevor Yann
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: