Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3680

Extracted text in wrong order [header, footer, content]

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.0.1
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      When I extract the text from the attached pdf, the text is in the wrong order.

      Every page has a header, which is "Bundesrecht konsolidiert" some content and a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the footer is a URL and the page number in German language.

      In my eyes the extracted text should have the same order, as we would look at it. The correct order would be header, content, footer.
      When I open the File in Adobe Reader an copy the text from the page, the text is also in the same order.

      The extracted text is:

      Bundesrecht konsolidiert
      www.ris.bka.gv.at Seite 1 von 35
      Gesamte Rechtsvorschrift [...] und Rechtsnachfolge

      When we look at the page; then the extracted text should be:

      Bundesrecht konsolidiert
      Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
      www.ris.bka.gv.at Seite 1 von 35

      The pdf itself and the extracted text of the first three pages is attached to this Ticket.

        Attachments

        1. 1_to_3_Text.txt
          8 kB
          Dominik Bauer
        2. DSG 2000, Fassung vom 27.01.2017.pdf
          370 kB
          Dominik Bauer

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              duffy356 Dominik Bauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: