Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-800

Wrong text extract from vertical textboxes in pdf files

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Windows 7, VS 2010 C#, Tika Library

      Description

      Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
      For example if there is a vertical textbox "hello" in a pdf file (Unable to render embedded object: File (WITHOUT) not found. line breaks):
      H
      E
      L
      L
      O
      the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
      Is there a option to avoid this problem?

        Attachments

        1. problemdoc.doc
          26 kB
          Sandor Dj
        2. problemdoc.pdf
          128 kB
          Sandor Dj

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                sandor1990 Sandor Dj
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: