Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-10218

ExtractDocumentText processor does not handle certain characters when extracting from a PDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Extensions
    • None

    Description

      When a PDF has special characters ("", "=",">", "-"), when the text is extracted from the document, these characters show up with different symbols.

      I've attached two PDFs that illustrate the issue differently:

      • 625006.pdf has multiple pages. When the text is extracted from a table, certain characters show up as a ? symbol.
      • example.pdf is a single page with the same table. When the text is extracted the same characters show up as " or # symbols.

      Attachments

        1. 625006_results.png
          529 kB
          Andrew M. Lim
        2. 625006.pdf
          617 kB
          Andrew M. Lim
        3. example_results.png
          530 kB
          Andrew M. Lim
        4. example.pdf
          58 kB
          Andrew M. Lim
        5. PDF_flow.json
          12 kB
          Andrew M. Lim

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andrewmlim Andrew M. Lim
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: