Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-349

Spaces between words ignored in scanned pdf files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0-incubator
    • Text extraction
    • None

    Description

      [Issue from SourceForge]
      http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832

      I am using PDF-Box-0.7.3.dll with C# and have tested extraction on two
      searchable pdfs that I have scanned in from paper. Spaces between words are
      ignored for both files. I have also tested another pdf file (which I
      downloaded from the internet) and it was parsed correctly. Unfortunately,
      the file is 1.2MB and the upload was blocked. Please send me an email
      (gkobzeff@hotmail.com) and I will reply back with the file.

      Thanks for looking into this.

      Greg

      [Comment on SourceForge]
      Date: 2008-03-23 21:24
      Sender: gkobzeff
      Logged In: YES
      user_id=2042611
      Originator: YES

      I have scanned the file into a smaller file size. I have attached the
      file.

      Thanks
      File Added: Advanced Pain Mgmt BW.pdf
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502

      Attachments

        1. SpacingFix.zip
          8 kB
          Justin LeFebvre
        2. UpdatedSpacingRegressionFiles.zip
          2.52 MB
          Justin LeFebvre

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jukkaz Jukka Zitting
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: