Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-83

Processing horizontally first then horizontally


    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:


      [imported from SourceForge]
      Originally submitted by tanvinguyen on 2005-08-24 13:11.

      I would like to see the implementation of coalescing
      where all words will be appended horizontally first then
      vertically. If this features is implemented properly all the
      fields of a table will be extracted and printed correctly
      as in the original PDF document.

      Sample: Page 2 of PDFBox References. All Content of
      column Project Name will be extracted before Colum

      Centric CRM
      Free To Use But
      The Most Advanced Open
      Source CRM Software.



      [attachment on SourceForge]
      HtmlOutputDev.h (text/plain), 8329 bytes
      This is the header file from PDFtoHTML

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES

      I uploaded an RTF file converted from PDF file using my
      applicatin developed in C++.

      [comment on SourceForge]
      Originally sent by tanvinguyen.
      Logged In: YES


      Thanks for quick response. Generally speaking, I highly
      appreciated your effort in developing such a wonderful open-
      source package.
      I am interesting in developing a PDF to RTF converter. Its
      main features include keeping all text attributes such as
      strikethru, underlined, fonts attributes, and spacing. In the
      past, I successfully developed an application in C++ using
      XPDF package and added code to do what I want.
      Now I would like to implement these features using PDFBox
      to deploy the application in a J2EE environment.

      Here's the basic algorithm they use in XPDF. First, they
      build a link list of string nodes. These string nodes contain x-
      y coordinates of text strings. Like your TextPosition
      instance, however their string nodes also contain all
      information about their coordinates including LowerLeft X,Y
      and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
      They store all these Strings nodes in major y-x axis.

      Then they coalesce and merge all string nodes with the
      same Y-coordinate first, therefore I was able to extract and
      convert into RTF and maintain the same content and format
      of PDF file.
      I am trying to figure out how to add extra information to your
      TextPosition class, so later on, I will be able to traverse thru
      major y-axis and build a list of these string nodes.

      If you can provide me information needed to obtain all
      information about coordinates or position of a text string, I
      think I will be able to implement these features. I will
      contribute these codes to your project.
      I uploaded a header file from XPDF, a sample PDF file which I
      tried to convert and an RTF file.
      I am not trying to convert "TABLE" from PDF file. I
      understand that concept does not exist in PDF.


      Tan V. Nguyen

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES

      text in a pdf document is drawn at x/y locations. Which
      means there is no relationship to text drawn in a column. If
      you can propose an algorithm to determine columns of text
      then I will implement it. As a side note, there is no such
      thing as a 'table' in a pdf document, only lines drawn between
      two points and text drawn at x/y locations. The only way
      a 'column' of could be determined is by analyzing lines on the
      PDF document, not an easy thing to do.

      Ben Litchfield




            • Assignee:
            • Votes:
              1 Vote for this issue
              1 Start watching this issue


              • Created: