Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-800

Wrong text extract from vertical textboxes in pdf files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7.0
    • None
    • Text extraction
    • None
    • Windows 7, VS 2010 C#, Tika Library

    Description

      Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
      For example if there is a vertical textbox "hello" in a pdf file (Unable to render embedded object: File (WITHOUT) not found. line breaks):
      H
      E
      L
      L
      O
      the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
      Is there a option to avoid this problem?

      Attachments

        1. problemdoc.doc
          26 kB
          Sandor Dj
        2. problemdoc.pdf
          128 kB
          Sandor Dj

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sandor1990 Sandor Dj
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: