Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-846

TextExtraction mixes case of text

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2.1
    • 1.3.1
    • Text extraction
    • None
    • Windows server, .NET

    Description

      Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
      "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.

      We are using this code to get the text in C#:

      byte[] pdfData = myWebClient.DownloadData(pdfUrl);
      string text = string.Empty;

      ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
      PDDocument doc = PDDocument.load(stream);
      PDFTextStripper stripper = new PDFTextStripper();
      text = stripper.getText(doc);
      doc.close();

      Attachments

        1. PDFBOX846-Menu_WA_032509.pdf
          400 kB
          Andreas Lehmkühler
        2. PDFBOX846-Menu_WA_032509.txt
          20 kB
          Andreas Lehmkühler

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lehmi Andreas Lehmkühler
            marklooi Mark Looi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment