PDFBox
  1. PDFBox
  2. PDFBOX-83

Processing horizontally first then horizontally

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
      Originally submitted by tanvinguyen on 2005-08-24 13:11.

      I would like to see the implementation of coalescing
      where all words will be appended horizontally first then
      vertically. If this features is implemented properly all the
      fields of a table will be extracted and printed correctly
      as in the original PDF document.

      Sample: Page 2 of PDFBox References. All Content of
      column Project Name will be extracted before Colum
      License.

      ===========
      Centric CRM
      (http://www.centriccrm.com)
      Free To Use But
      Restricted/Commercial
      The Most Advanced Open
      Source CRM Software.
      =============

      Thanks,

      -tan

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
      HtmlOutputDev.h (text/plain), 8329 bytes
      This is the header file from PDFtoHTML

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES
      user_id=683822

      I uploaded an RTF file converted from PDF file using my
      applicatin developed in C++.

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES
      user_id=683822

      Ben,

      Thanks for quick response. Generally speaking, I highly
      appreciated your effort in developing such a wonderful open-
      source package.
      I am interesting in developing a PDF to RTF converter. Its
      main features include keeping all text attributes such as
      strikethru, underlined, fonts attributes, and spacing. In the
      past, I successfully developed an application in C++ using
      XPDF package and added code to do what I want.
      Now I would like to implement these features using PDFBox
      to deploy the application in a J2EE environment.

      Here's the basic algorithm they use in XPDF. First, they
      build a link list of string nodes. These string nodes contain x-
      y coordinates of text strings. Like your TextPosition
      instance, however their string nodes also contain all
      information about their coordinates including LowerLeft X,Y
      and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
      They store all these Strings nodes in major y-x axis.

      Then they coalesce and merge all string nodes with the
      same Y-coordinate first, therefore I was able to extract and
      convert into RTF and maintain the same content and format
      of PDF file.
      I am trying to figure out how to add extra information to your
      TextPosition class, so later on, I will be able to traverse thru
      major y-axis and build a list of these string nodes.

      If you can provide me information needed to obtain all
      information about coordinates or position of a text string, I
      think I will be able to implement these features. I will
      contribute these codes to your project.
      I uploaded a header file from XPDF, a sample PDF file which I
      tried to convert and an RTF file.
      I am not trying to convert "TABLE" from PDF file. I
      understand that concept does not exist in PDF.

      Thanks,

      Tan V. Nguyen

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES
      user_id=601708

      text in a pdf document is drawn at x/y locations. Which
      means there is no relationship to text drawn in a column. If
      you can propose an algorithm to determine columns of text
      then I will implement it. As a side note, there is no such
      thing as a 'table' in a pdf document, only lines drawn between
      two points and text drawn at x/y locations. The only way
      a 'column' of could be determined is by analyzing lines on the
      PDF document, not an easy thing to do.

      Ben Litchfield

        Activity

        Hide
        Brian Carrier added a comment -

        Did you enable sorting when you extracted the text?

        I'm not sure what your proposed algorithm would give you that the current version does not already do when sorting is enabled (if you are proposing to only add spaces between table columns). My reading of your algorithm would result in sets that have the same X coordinate, but then what? Do you print each line in the column independently? Merge them together? Can you outline your full algorithm in some form of pseudo-code?

        Preserving formatting is a large issue with PDFBox. The issue also exists with multi-column text and paragraph detection. In fact, it seems that the algorithm for tables would be opposite of multi-column extraction. With both, you want to detect structured columns, but with tables you would want to process each column on one row, then the next row, etc. With multi-column text, you want to finish the first column first before moving on to the second. Detecting a two column table and its rows versus two column text with several paragraphs in each would require a lot of processing that does not currently exist.

        Show
        Brian Carrier added a comment - Did you enable sorting when you extracted the text? I'm not sure what your proposed algorithm would give you that the current version does not already do when sorting is enabled (if you are proposing to only add spaces between table columns). My reading of your algorithm would result in sets that have the same X coordinate, but then what? Do you print each line in the column independently? Merge them together? Can you outline your full algorithm in some form of pseudo-code? Preserving formatting is a large issue with PDFBox. The issue also exists with multi-column text and paragraph detection. In fact, it seems that the algorithm for tables would be opposite of multi-column extraction. With both, you want to detect structured columns, but with tables you would want to process each column on one row, then the next row, etc. With multi-column text, you want to finish the first column first before moving on to the second. Detecting a two column table and its rows versus two column text with several paragraphs in each would require a lot of processing that does not currently exist.
        Hide
        George Van Treeck added a comment - - edited

        I just tried the latest version and ran into the issue here (jumbled text in a PDF table). I think the following alogrithm might work to fix the problem.

        First sort all text items into sets having the same x coordinate, i.e., assume all vertically adjacent text items with same x coordiante are all part of table cell. For each set, select a text item and locate a horizontally adjacent text item (same y coordinate), if the adjacent text item is part of another set of text items all sharing an x coordinate, then the adjacent item is part of a different table cell, which means you should concatenate all the text items in the first set and then concatenate all the text items in the adjacent set.

        Show
        George Van Treeck added a comment - - edited I just tried the latest version and ran into the issue here (jumbled text in a PDF table). I think the following alogrithm might work to fix the problem. First sort all text items into sets having the same x coordinate, i.e., assume all vertically adjacent text items with same x coordiante are all part of table cell. For each set, select a text item and locate a horizontally adjacent text item (same y coordinate), if the adjacent text item is part of another set of text items all sharing an x coordinate, then the adjacent item is part of a different table cell, which means you should concatenate all the text items in the first set and then concatenate all the text items in the adjacent set.

          People

          • Assignee:
            Unassigned
            Reporter:
            Anonymous
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development