[TIKA-2532] Output for PDF file contains X-TIKA:content that is a PDF fragment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.15, 1.16, 1.17
Fix Version/s: None
Component/s: parser
Labels:
None
Environment:

Ubuntu 64 bit
JDK 1.8

Description

I have a PDF file that returns two elements in the recursive json output. The first element is text, as expected. The second element seems to be a fragment of a PDF file, rather than extracted text.

The start of the second element in the json output is:
{
"Content-Encoding": "ISO-8859-1",
"Content-Length": "-1",
"Content-Type": "text/plain; charset\u003dISO-8859-1",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:content": "\u003c\u003c\n /ASCII85EncodePages false\n /AllowTransparency false\n /AutoPositionEPSFiles true\n /AutoRotatePages /None\n /Binding /Left\n /CalGrayProfile (Gray Gamma 2.2)\n /CalRGBProfile (sRGB IEC61966-2.1)\n /CalCMYKProfile (U.S. Web Coated \\050SWOP
051 v2)\n /sRGBProfile (sRGB IEC61966-2.1)\n /CannotEmbedFontPolicy /Warning\n /CompatibilityLevel 1.4\n /CompressObjects /Off\n /CompressPages true\n /ConvertImagesToIndexed true\n /PassThroughJPEGImages true\n /CreateJobTicket false\n /DefaultRenderingIntent /Default\n /DetectBlends true\n /DetectCurves 0.0000\n /ColorConversionStrategy /LeaveColorUnchanged\n /DoThumbnails true\n /EmbedAllFonts true\n /EmbedOpenType false\n /ParseICCProfilesInComments true\n /EmbedJobOptions true\n /DSCReportingLevel 0\n /EmitDSCWarnings false\n /EndPage 1\n /ImageMemory 1048576\n /LockDistillerParams true\n /MaxSubsetPct 100\n /Optimize true\n /OPM 0\n /ParseDSCComments false\n /ParseDSCCommentsForDocInfo false\n /PreserveCopyPage true\n /PreserveDICMYKValues true\n /PreserveEPSInfo false\n /PreserveFlatness true\n /PreserveHalftoneInfo true\n /PreserveOPIComments false\n /PreserveOverprintSettings true\n /StartPage 1\n /SubsetFonts true\n /TransferFunctionInfo /Remove\n /UCRandBGInfo /Preserve\n /UsePrologue false\n /ColorSettingsFile ()\n /AlwaysEmbed [ true\n /AbadiMT-CondensedLight\n /ACaslon-Italic\n /ACaslon

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

A_latent_topic_model_for_complete_entity.pdf
20/Dec/17 01:16
956 kB
Trevor Yann

Activity

People

Assignee:: Unassigned

Reporter:: Trevor Yann

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Dec/17 01:15

Updated:: 20/Dec/17 15:33

Resolved:: 20/Dec/17 15:33