PDFBox
  1. PDFBox
  2. PDFBOX-970

TeX-created ligatures and umlauts are not recognised

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)

      Description

      Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
      1.4 1.5
      official ocial
      effort e ort
      fields elds
      first rst
      In addition, German umlauts (ä, ö, ü) are represented as ( a, o, u),

      1. A Python Library for Provenance Recording and Querying.txt
        28 kB
        Thomas Fischer
      2. A Python Library for Provenance Recording and Querying.txt
        28 kB
        Thomas Fischer
      3. Test.pdf
        58 kB
        Thomas Fischer
      4. Test.pdf
        58 kB
        Thomas Fischer
      5. Test2.1.4.txt
        25 kB
        Thomas Fischer
      6. Test2-1.6.txt
        28 kB
        Thomas Fischer
      7. Test2.pdf
        329 kB
        Thomas Fischer

        Activity

        Thomas Fischer created issue -
        Hide
        Thomas Fischer added a comment -

        A PDF file and the respective text extractions with v. 1.4 and v. 1.5 from http://www.aero-grid.de/ergebnisse/publikationen/ipaw08-id43-bochner-gude-schreiber.pdf

        Show
        Thomas Fischer added a comment - A PDF file and the respective text extractions with v. 1.4 and v. 1.5 from http://www.aero-grid.de/ergebnisse/publikationen/ipaw08-id43-bochner-gude-schreiber.pdf
        Thomas Fischer made changes -
        Field Original Value New Value
        Attachment A Python Library for Provenance Recording and Querying.txt [ 12472750 ]
        Attachment A Python Library for Provenance Recording and Querying.txt [ 12472751 ]
        Hide
        Andreas Lehmkühler added a comment -

        I solved the issue in revision 1078518. But I can only confirm that it works for ligatures as your example doesn't contain any german umlauts. Can you provide us with an other example or can you confirm that this solution also works for that kind of pdfs?

        Show
        Andreas Lehmkühler added a comment - I solved the issue in revision 1078518. But I can only confirm that it works for ligatures as your example doesn't contain any german umlauts. Can you provide us with an other example or can you confirm that this solution also works for that kind of pdfs?
        Thomas Fischer made changes -
        Attachment Test.pdf [ 12472825 ]
        Hide
        Thomas Fischer added a comment -

        I downloaded and built revision 1078518 (pdfbox-1.6.0-SNAPSHOT.jar with font.box and jemp.box). While the ligatures seem to be OK, the umlauts are not: ü is represented as u¨ etc. (not a combining ¨). Furthermore, '„', opening German quote, is represented as '\n”\n' (a line break before and after a closing German quote). I try to attach a test file Test.pdf (I didn't succeed yesterday; where do I report errors of jira?).

        Show
        Thomas Fischer added a comment - I downloaded and built revision 1078518 (pdfbox-1.6.0-SNAPSHOT.jar with font.box and jemp.box). While the ligatures seem to be OK, the umlauts are not: ü is represented as u¨ etc. (not a combining ¨). Furthermore, '„', opening German quote, is represented as '\n”\n' (a line break before and after a closing German quote). I try to attach a test file Test.pdf (I didn't succeed yesterday; where do I report errors of jira?).
        Thomas Fischer made changes -
        Attachment Test.pdf [ 12472826 ]
        Hide
        Andreas Lehmkühler added a comment -

        I can't confirm the umlaut issue. The latest snapshot works fine for me. Do you have the icu-jar on your classpath?

        The position of the german quote seems to be misinterpreted. Because of being placed very low on the line the algo presumes is has to be on the next line. It was already an issue with 1.4.0

        I guess the JIRA error occured because of some maintenance ( the infra guys just upgraded JIRA to 4.2.4).

        Show
        Andreas Lehmkühler added a comment - I can't confirm the umlaut issue. The latest snapshot works fine for me. Do you have the icu-jar on your classpath? The position of the german quote seems to be misinterpreted. Because of being placed very low on the line the algo presumes is has to be on the next line. It was already an issue with 1.4.0 I guess the JIRA error occured because of some maintenance ( the infra guys just upgraded JIRA to 4.2.4).
        Hide
        Thomas Fischer added a comment -

        I put a file icu-4.0.1.jar into my classpath and that essentially resolved the umlaut issue, they are now represented as combined characters (I'm not quite sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need the additional icu, was the need introduced in a recent version change?
        Unfortunately there are still some strange problems with the conversion, essentially missing characters. I upload a new test file and conversions using pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some additional differences).

        Show
        Thomas Fischer added a comment - I put a file icu-4.0.1.jar into my classpath and that essentially resolved the umlaut issue, they are now represented as combined characters (I'm not quite sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need the additional icu, was the need introduced in a recent version change? Unfortunately there are still some strange problems with the conversion, essentially missing characters. I upload a new test file and conversions using pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some additional differences).
        Thomas Fischer made changes -
        Attachment Test2.1.4.txt [ 12472922 ]
        Attachment Test2-1.6.txt [ 12472923 ]
        Attachment Test2.pdf [ 12472924 ]
        John Hewson made changes -
        Component/s Text extraction [ 12312228 ]
        Component/s FontBox [ 12312221 ]
        Hide
        John Hewson added a comment - - edited

        I'm not getting combined characters for the umlaut with 2.0 trunk. Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts it as "fu ̈r", so it's not clear that we really need to be trying to combine it.

        Update: Passing -encoding "UTF-8" to ExtractText gets me the combined characters as expected.

        Show
        John Hewson added a comment - - edited I'm not getting combined characters for the umlaut with 2.0 trunk . Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts it as "fu ̈r", so it's not clear that we really need to be trying to combine it. Update: Passing -encoding "UTF-8" to ExtractText gets me the combined characters as expected.
        John Hewson made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Not a Problem [ 8 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Thomas Fischer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development