Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2532

Text extraction fails due to the usage of the internal font mapping

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • 4.0.0
    • Text extraction
    • None

    Description

      If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.

      Attachments

        1. PDFBOX2247-701542_cp_acrobat.txt
          7 kB
          Andreas Lehmkühler
        2. PDFBOX2247-701542_sa_acrobat_osx.txt
          7 kB
          Maruan Sahyoun
        3. PDFBOX2247-701542_sa_acrobat.txt
          7 kB
          Andreas Lehmkühler
        4. PDFBOX2247-701542_sa_reader_osx.txt
          7 kB
          Maruan Sahyoun
        5. PDFBOX2247-701542.pdf
          242 kB
          Andreas Lehmkühler
        6. PDFBOX2247-Debugger.png
          137 kB
          Andreas Lehmkühler

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            lehmi Andreas Lehmkühler

            Dates

              Created:
              Updated:

              Issue deployment