PDFBox
  1. PDFBox
  2. PDFBOX-970

TeX-created ligatures and umlauts are not recognised

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)

      Description

      Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
      1.4 1.5
      official ocial
      effort e ort
      fields elds
      first rst
      In addition, German umlauts (ä, ö, ü) are represented as ( a, o, u),

      1. Test2.pdf
        329 kB
        Thomas Fischer
      2. Test2-1.6.txt
        28 kB
        Thomas Fischer
      3. Test2.1.4.txt
        25 kB
        Thomas Fischer
      4. Test.pdf
        58 kB
        Thomas Fischer
      5. Test.pdf
        58 kB
        Thomas Fischer
      6. A Python Library for Provenance Recording and Querying.txt
        28 kB
        Thomas Fischer
      7. A Python Library for Provenance Recording and Querying.txt
        28 kB
        Thomas Fischer

        Activity

        Hide
        Thomas Fischer added a comment -

        A PDF file and the respective text extractions with v. 1.4 and v. 1.5 from http://www.aero-grid.de/ergebnisse/publikationen/ipaw08-id43-bochner-gude-schreiber.pdf

        Show
        Thomas Fischer added a comment - A PDF file and the respective text extractions with v. 1.4 and v. 1.5 from http://www.aero-grid.de/ergebnisse/publikationen/ipaw08-id43-bochner-gude-schreiber.pdf
        Hide
        Andreas Lehmkühler added a comment -

        I solved the issue in revision 1078518. But I can only confirm that it works for ligatures as your example doesn't contain any german umlauts. Can you provide us with an other example or can you confirm that this solution also works for that kind of pdfs?

        Show
        Andreas Lehmkühler added a comment - I solved the issue in revision 1078518. But I can only confirm that it works for ligatures as your example doesn't contain any german umlauts. Can you provide us with an other example or can you confirm that this solution also works for that kind of pdfs?
        Hide
        Thomas Fischer added a comment -

        I downloaded and built revision 1078518 (pdfbox-1.6.0-SNAPSHOT.jar with font.box and jemp.box). While the ligatures seem to be OK, the umlauts are not: ü is represented as u¨ etc. (not a combining ¨). Furthermore, '„', opening German quote, is represented as '\n”\n' (a line break before and after a closing German quote). I try to attach a test file Test.pdf (I didn't succeed yesterday; where do I report errors of jira?).

        Show
        Thomas Fischer added a comment - I downloaded and built revision 1078518 (pdfbox-1.6.0-SNAPSHOT.jar with font.box and jemp.box). While the ligatures seem to be OK, the umlauts are not: ü is represented as u¨ etc. (not a combining ¨). Furthermore, '„', opening German quote, is represented as '\n”\n' (a line break before and after a closing German quote). I try to attach a test file Test.pdf (I didn't succeed yesterday; where do I report errors of jira?).
        Hide
        Andreas Lehmkühler added a comment -

        I can't confirm the umlaut issue. The latest snapshot works fine for me. Do you have the icu-jar on your classpath?

        The position of the german quote seems to be misinterpreted. Because of being placed very low on the line the algo presumes is has to be on the next line. It was already an issue with 1.4.0

        I guess the JIRA error occured because of some maintenance ( the infra guys just upgraded JIRA to 4.2.4).

        Show
        Andreas Lehmkühler added a comment - I can't confirm the umlaut issue. The latest snapshot works fine for me. Do you have the icu-jar on your classpath? The position of the german quote seems to be misinterpreted. Because of being placed very low on the line the algo presumes is has to be on the next line. It was already an issue with 1.4.0 I guess the JIRA error occured because of some maintenance ( the infra guys just upgraded JIRA to 4.2.4).
        Hide
        Thomas Fischer added a comment -

        I put a file icu-4.0.1.jar into my classpath and that essentially resolved the umlaut issue, they are now represented as combined characters (I'm not quite sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need the additional icu, was the need introduced in a recent version change?
        Unfortunately there are still some strange problems with the conversion, essentially missing characters. I upload a new test file and conversions using pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some additional differences).

        Show
        Thomas Fischer added a comment - I put a file icu-4.0.1.jar into my classpath and that essentially resolved the umlaut issue, they are now represented as combined characters (I'm not quite sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need the additional icu, was the need introduced in a recent version change? Unfortunately there are still some strange problems with the conversion, essentially missing characters. I upload a new test file and conversions using pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some additional differences).

          People

          • Assignee:
            Unassigned
            Reporter:
            Thomas Fischer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development