Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1553

Offset of extracted coordinates

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 1.8.0
    • None
    • None
    • Linux Ubuntu 64 bit, Java

    Description

      Hello,

      Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys!

      We have one problem. For our application purposes we extract from pdf "char by char" with rispective coordinates for each char. (see attached Parser)
      After this we group chars into the words. We noticed that for some pdf documents we have a strange "offset" for extracted rect coordinates. (see screens)

      The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm..
      If I make selection in Adobe Reader - it seems all ok.

      I attached two pdf files with offset to this post.
      If you want to see the offset "in action" you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

      Please can you test these files and tell me if it is a really bug?
      How we can resolve it?

      Thanks,
      Vitalie

      Attachments

        1. Selection in Adobe Reader.png
          155 kB
          Vitalie Bureanu
        2. Parser.java
          3 kB
          Vitalie Bureanu
        3. Extracted coordinates of rects.jpg
          75 kB
          Vitalie Bureanu
        4. EnSt11_offset.pdf
          25 kB
          Vitalie Bureanu
        5. EnSt10_offset.pdf
          74 kB
          Vitalie Bureanu

        Activity

          People

            Unassigned Unassigned
            vitalie_bureanu Vitalie Bureanu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified