Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 1.3.1
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      tika-0.8

      Description

      german umlaute are not recognized in this document
      http://www.computing.dcu.ie/~irehbein/SS08/uebung1/stts-guide.pdf

      Guidelines f
      
      ur das Tagging deutscher Textcorpora

      1. stts-guide.pdf
        386 kB
        Jukka Zitting

        Issue Links

          Activity

          Hide
          Reinhard Schwab added a comment -

          i have checked this now with the current trunk.
          i also have seen spaces inside words.
          example:

          das T agging deutsc her T extcorp ora

          Show
          Reinhard Schwab added a comment - i have checked this now with the current trunk. i also have seen spaces inside words. example: das T agging deutsc her T extcorp ora
          Hide
          Andreas Pieber added a comment -

          Have u done something like:

          private static PDFont configureFont(PDFont font)

          { font.setEncoding(new WinAnsiEncoding()); return font; }
          Show
          Andreas Pieber added a comment - Have u done something like: private static PDFont configureFont(PDFont font) { font.setEncoding(new WinAnsiEncoding()); return font; }
          Hide
          Reinhard Schwab added a comment -

          no, never done this. would you recommend to try it out?

          btw, i dont get the spaces within words with an older version of pdfbox, which is installed in my app.
          this seems to be a regression.
          i cant say which version it has worked, i sometimes test snapshots
          of pdfbox and rebuild tika.

          this is my dependency there in tika-parsers/pom.xml
          <dependency>
          <groupId>org.apache.pdfbox</groupId>
          <artifactId>pdfbox</artifactId>
          <version>1.3.0-SNAPSHOT</version>
          </dependency>

          so it seems to happen in the current trunk.

          Show
          Reinhard Schwab added a comment - no, never done this. would you recommend to try it out? btw, i dont get the spaces within words with an older version of pdfbox, which is installed in my app. this seems to be a regression. i cant say which version it has worked, i sometimes test snapshots of pdfbox and rebuild tika. this is my dependency there in tika-parsers/pom.xml <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>1.3.0-SNAPSHOT</version> </dependency> so it seems to happen in the current trunk.
          Hide
          Andreas Lehmkühler added a comment -

          I guess the extra spaces where introduced with PDFBOX-828. I'll check the PDFTextStripper class later and I hope to fix it soon.

          Show
          Andreas Lehmkühler added a comment - I guess the extra spaces where introduced with PDFBOX-828 . I'll check the PDFTextStripper class later and I hope to fix it soon.
          Hide
          Reinhard Schwab added a comment -

          after reading documentation and api, i have no idea how i get an instance of a font or how to set a font
          in the process of text extraction.
          the code i use is

          PDDocument doc = PDDocument.load(new URL(urls[idx]));
          PDFTextStripper stripper = new PDFTextStripper();
          stripper.writeText(doc, new OutputStreamWriter(System.out));

          can you be more specific and give more guidance?

          Show
          Reinhard Schwab added a comment - after reading documentation and api, i have no idea how i get an instance of a font or how to set a font in the process of text extraction. the code i use is PDDocument doc = PDDocument.load(new URL(urls [idx] )); PDFTextStripper stripper = new PDFTextStripper(); stripper.writeText(doc, new OutputStreamWriter(System.out)); can you be more specific and give more guidance?
          Hide
          Andreas Pieber added a comment -

          It seams that this issue is a different one as I had (as andreas pointed out).

          Nevertheless, this is another method than I used. Basically I modified [1] using the configureFont method (I posted) to handle problems with German Umlauts. But as said, you're using a different method than me

          [1] http://svn.apache.org/repos/asf/pdfbox/tags/1.2.1/pdfbox/src/main/java/org/apache/pdfbox/TextToPDF.java

          Show
          Andreas Pieber added a comment - It seams that this issue is a different one as I had (as andreas pointed out). Nevertheless, this is another method than I used. Basically I modified [1] using the configureFont method (I posted) to handle problems with German Umlauts. But as said, you're using a different method than me [1] http://svn.apache.org/repos/asf/pdfbox/tags/1.2.1/pdfbox/src/main/java/org/apache/pdfbox/TextToPDF.java
          Hide
          Jukka Zitting added a comment -

          FTR, I attached a copy of the referenced document.

          Show
          Jukka Zitting added a comment - FTR, I attached a copy of the referenced document.
          Hide
          Jukka Zitting added a comment -

          Looking at the PDF stream it seems like the document implements the umlaut by explicitly positioning a special umlaut character on top of the following "u". Properly detecting such cases may require quite a bit of work, so I'm postponing this from the 1.3.0 release.

          Show
          Jukka Zitting added a comment - Looking at the PDF stream it seems like the document implements the umlaut by explicitly positioning a special umlaut character on top of the following "u". Properly detecting such cases may require quite a bit of work, so I'm postponing this from the 1.3.0 release.
          Hide
          Reinhard Schwab added a comment -

          can you give me a hint where to look at?
          may be i can contribute a patch.

          Show
          Reinhard Schwab added a comment - can you give me a hint where to look at? may be i can contribute a patch.
          Hide
          Andreas Lehmkühler added a comment -

          PDFTextStripper.processTextPosition should be a good point to start, but it'll not that easy. Have fun ...

          Show
          Andreas Lehmkühler added a comment - PDFTextStripper.processTextPosition should be a good point to start, but it'll not that easy. Have fun ...
          Hide
          Andreas Lehmkühler added a comment -

          I fixed the additional space issue in revision 1023048. Contrary to other fonts type3 fonts are providing the character width in glyph units and not in thousandths of a unit of text space

          Show
          Andreas Lehmkühler added a comment - I fixed the additional space issue in revision 1023048. Contrary to other fonts type3 fonts are providing the character width in glyph units and not in thousandths of a unit of text space
          Hide
          Reinhard Schwab added a comment -

          yes, i can confirm, this seems to be fixed. i dont have the extra spaces now in my test case.

          in regard to the umlaute,
          there are other special unicode characters also in the text.
          not only to indicate umlaute.
          but also to indicate list items.

          example:
          je/KOUS schoner die Spatzen singen, desto/KON spater ist es.9
          je/KOUS spater der Abend, um/APPR so/ADV schoner die Gaste.
          je/KOUS spater der Abend, umso/KON schoner die Gaste.

          does this need a special mapping?

          some of them i dont understand now

          Der Begri\u000B Wortform

          in Zi\u000Bern, Satzzeichen

          Show
          Reinhard Schwab added a comment - yes, i can confirm, this seems to be fixed. i dont have the extra spaces now in my test case. in regard to the umlaute, there are other special unicode characters also in the text. not only to indicate umlaute. but also to indicate list items. example: je/KOUS schoner die Spatzen singen, desto/KON spater ist es.9 je/KOUS spater der Abend, um/APPR so/ADV schoner die Gaste. je/KOUS spater der Abend, umso/KON schoner die Gaste. does this need a special mapping? some of them i dont understand now Der Begri\u000B Wortform in Zi\u000Bern, Satzzeichen
          Hide
          Lars Torunski added a comment -

          After upgrading from pdfbox 0.8.0 to 1.2.1, but also updating the application server and changing the file.encoding from iso8859 to utf-8, we are facing similar problems with probably unicode characters. We thought that updating the application server and changing the file.encoding was causing this behaviour.

          I would add links to our PDFs when my comment could be stricted to jira-developers by changing "Viewable By" option.

          Show
          Lars Torunski added a comment - After upgrading from pdfbox 0.8.0 to 1.2.1, but also updating the application server and changing the file.encoding from iso8859 to utf-8, we are facing similar problems with probably unicode characters. We thought that updating the application server and changing the file.encoding was causing this behaviour. I would add links to our PDFs when my comment could be stricted to jira-developers by changing "Viewable By" option.
          Hide
          John Hewson added a comment -

          Adobe acrobat is not able to extract the umlaute from this PDF, nor are any of the other PDF readers which I've tried. Looks like there's nothing we can do.

          Show
          John Hewson added a comment - Adobe acrobat is not able to extract the umlaute from this PDF, nor are any of the other PDF readers which I've tried. Looks like there's nothing we can do.

            People

            • Assignee:
              Unassigned
              Reporter:
              Reinhard Schwab
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development