Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2377

Apparent regression in character mapping in a few files from govdocs1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.7
    • 1.8.8
    • Text extraction

    Description

      On a small number of test files in a 50k sample of pdfs from govdocs1, it appears that some characters are no longer being extracted correctly in 1.8.7 when compared to 1.8.6. I ran pdfbox's app.jar with ExtractText

      764929.pdf
      1.8.6: Lang, Astrophysical Data: Planets and Stars
      1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
      

      and

      312888.pdf
      1.8.6: Self-Assessment \u0026 Capability Description
      1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
      

      Attachments

        1. PDFBOX2247-701542.pdf
          242 kB
          Andreas Lehmkühler
        2. 764929.pdf
          63 kB
          Tim Allison
        3. 357094-1.8.8.txt
          3 kB
          Tilman Hausherr
        4. 357094-1.8.6.txt
          3 kB
          Tilman Hausherr
        5. 357094.pdf
          22 kB
          Tilman Hausherr
        6. 312888.pdf
          58 kB
          Tim Allison
        7. 290991-8.txt
          3 kB
          Tilman Hausherr
        8. 290991-7.txt
          3 kB
          Tilman Hausherr
        9. 290991-6.txt
          3 kB
          Tilman Hausherr
        10. 290991.pdf
          22 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: