Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3070

Null bytes in extracted metadata

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.23
    • None
    • server
    • None
    • Docker image: apache/tika:1.23

    Description

      Both /rmeta/text and unpack/all return null bytes in metadata. 

       

      Note "pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000"

       

      $ curl -T Technical_manual.pdf http://localhost:9998/rmeta/text 
      
      [{
        "Content-Type": "application/pdf",
        "Creation-Date": "2018-08-21T09:40:33Z",
        "X-Parsed-By": [
          "org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.pdf.PDFParser"
        ],
        "X-TIKA:embedded_depth": "0",
        "X-TIKA:parse_time_millis": "42",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "dc:format": "application/pdf; version\u003d1.4",
        "dcterms:created": "2018-08-21T09:40:33Z",
        "meta:creation-date": "2018-08-21T09:40:33Z",
        "pdf:PDFVersion": "1.4",
        "pdf:charsPerPage": [
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0"
        ],
        "pdf:docinfo:created": "2018-08-21T09:40:33Z",
        "pdf:docinfo:creator_tool": "Canon iR-ADV C5235  PDF",
        "pdf:docinfo:producer": "Adobe PSL 1.2e for Canon\u0000",
        "pdf:encrypted": "false",
        "pdf:hasXFA": "false",
        "pdf:hasXMP": "true",
        "pdf:unmappedUnicodeCharsPerPage": [
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0"
        ],
        "xmp:CreatorTool": "Canon iR-ADV C5235  PDF",
        "xmpMM:DocumentID": "uuid:03e07b5b-0000-f481-39c4-e94700000000",
        "xmpTPg:NPages": "31"
      }]
      

       

       

      Other example. 

      Note fields "pdf:docinfo:creator_tool": "DigiPath\u0000", "pdf:docinfo:producer": "DigiPath\u0000" and "xmp:CreatorTool": "DigiPath\u0000"

       

      [{
        "Content-Type": "application/pdf",
        "Last-Modified": "2006-03-02T08:53:15Z",
        "Last-Save-Date": "2006-03-02T08:53:15Z",
        "X-Parsed-By": [
          "org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.pdf.PDFParser"
        ],
        "X-TIKA:embedded_depth": "0",
        "X-TIKA:parse_time_millis": "96",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "date": "2006-03-02T08:53:15Z",
        "dc:format": "application/pdf; version\u003d1.3",
        "dcterms:modified": "2006-03-02T08:53:15Z",
        "meta:save-date": "2006-03-02T08:53:15Z",
        "modified": "2006-03-02T08:53:15Z",
        "pdf:PDFVersion": "1.3",
        "pdf:charsPerPage": [
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0"
        ],
        "pdf:docinfo:creator_tool": "DigiPath\u0000",
        "pdf:docinfo:modified": "2006-03-02T08:53:15Z",
        "pdf:docinfo:producer": "DigiPath\u0000",
        "pdf:encrypted": "false",
        "pdf:hasXFA": "false",
        "pdf:hasXMP": "false",
        "pdf:unmappedUnicodeCharsPerPage": [
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0",
          "0"
        ],
        "xmp:CreatorTool": "DigiPath\u0000",
        "xmpTPg:NPages": "14"
      }]
      

       

       

       

      Attachments

        1. Technical_manual.pdf
          1.40 MB
          Carina Antunes

        Activity

          People

            Unassigned Unassigned
            carina.antunes Carina Antunes
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: