Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2532

Text extraction fails due to the usage of the internal font mapping

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • 4.0.0
    • Text extraction
    • None

    Description

      If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.

      Attachments

        1. PDFBOX2247-Debugger.png
          137 kB
          Andreas Lehmkühler
        2. PDFBOX2247-701542_sa_reader_osx.txt
          7 kB
          Maruan Sahyoun
        3. PDFBOX2247-701542_sa_acrobat_osx.txt
          7 kB
          Maruan Sahyoun
        4. PDFBOX2247-701542_sa_acrobat.txt
          7 kB
          Andreas Lehmkühler
        5. PDFBOX2247-701542_cp_acrobat.txt
          7 kB
          Andreas Lehmkühler
        6. PDFBOX2247-701542.pdf
          242 kB
          Andreas Lehmkühler

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lehmi Andreas Lehmkühler
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: