PDFBox
  1. PDFBox
  2. PDFBOX-83

Processing horizontally first then horizontally

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
      Originally submitted by tanvinguyen on 2005-08-24 13:11.

      I would like to see the implementation of coalescing
      where all words will be appended horizontally first then
      vertically. If this features is implemented properly all the
      fields of a table will be extracted and printed correctly
      as in the original PDF document.

      Sample: Page 2 of PDFBox References. All Content of
      column Project Name will be extracted before Colum
      License.

      ===========
      Centric CRM
      (http://www.centriccrm.com)
      Free To Use But
      Restricted/Commercial
      The Most Advanced Open
      Source CRM Software.
      =============

      Thanks,

      -tan

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
      HtmlOutputDev.h (text/plain), 8329 bytes
      This is the header file from PDFtoHTML

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES
      user_id=683822

      I uploaded an RTF file converted from PDF file using my
      applicatin developed in C++.

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES
      user_id=683822

      Ben,

      Thanks for quick response. Generally speaking, I highly
      appreciated your effort in developing such a wonderful open-
      source package.
      I am interesting in developing a PDF to RTF converter. Its
      main features include keeping all text attributes such as
      strikethru, underlined, fonts attributes, and spacing. In the
      past, I successfully developed an application in C++ using
      XPDF package and added code to do what I want.
      Now I would like to implement these features using PDFBox
      to deploy the application in a J2EE environment.

      Here's the basic algorithm they use in XPDF. First, they
      build a link list of string nodes. These string nodes contain x-
      y coordinates of text strings. Like your TextPosition
      instance, however their string nodes also contain all
      information about their coordinates including LowerLeft X,Y
      and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
      They store all these Strings nodes in major y-x axis.

      Then they coalesce and merge all string nodes with the
      same Y-coordinate first, therefore I was able to extract and
      convert into RTF and maintain the same content and format
      of PDF file.
      I am trying to figure out how to add extra information to your
      TextPosition class, so later on, I will be able to traverse thru
      major y-axis and build a list of these string nodes.

      If you can provide me information needed to obtain all
      information about coordinates or position of a text string, I
      think I will be able to implement these features. I will
      contribute these codes to your project.
      I uploaded a header file from XPDF, a sample PDF file which I
      tried to convert and an RTF file.
      I am not trying to convert "TABLE" from PDF file. I
      understand that concept does not exist in PDF.

      Thanks,

      Tan V. Nguyen

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES
      user_id=601708

      text in a pdf document is drawn at x/y locations. Which
      means there is no relationship to text drawn in a column. If
      you can propose an algorithm to determine columns of text
      then I will implement it. As a side note, there is no such
      thing as a 'table' in a pdf document, only lines drawn between
      two points and text drawn at x/y locations. The only way
      a 'column' of could be determined is by analyzing lines on the
      PDF document, not an easy thing to do.

      Ben Litchfield

        Activity

        Anonymous created issue -
        Jukka Zitting made changes -
        Field Original Value New Value
        Priority Major [ 3 ]
        John Hewson made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Not a Problem [ 8 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Anonymous
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development