[PDFBOX-83] Processing horizontally first then horizontally - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
Originally submitted by tanvinguyen on 2005-08-24 13:11.

I would like to see the implementation of coalescing
where all words will be appended horizontally first then
vertically. If this features is implemented properly all the
fields of a table will be extracted and printed correctly
as in the original PDF document.

Sample: Page 2 of PDFBox References. All Content of
column Project Name will be extracted before Colum
License.

===========
Centric CRM
(http://www.centriccrm.com)
Free To Use But
Restricted/Commercial
The Most Advanced Open
Source CRM Software.
=============

Thanks,

-tan

[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
HtmlOutputDev.h (text/plain), 8329 bytes
This is the header file from PDFtoHTML

[comment on SourceForge]
Originally sent by tanvinguyen.
Logged In: YES
user_id=683822

I uploaded an RTF file converted from PDF file using my
applicatin developed in C++.

[comment on SourceForge]
Originally sent by tanvinguyen.
Logged In: YES
user_id=683822

Ben,

Thanks for quick response. Generally speaking, I highly
appreciated your effort in developing such a wonderful open-
source package.
I am interesting in developing a PDF to RTF converter. Its
main features include keeping all text attributes such as
strikethru, underlined, fonts attributes, and spacing. In the
past, I successfully developed an application in C++ using
XPDF package and added code to do what I want.
Now I would like to implement these features using PDFBox
to deploy the application in a J2EE environment.

Here's the basic algorithm they use in XPDF. First, they
build a link list of string nodes. These string nodes contain x-
y coordinates of text strings. Like your TextPosition
instance, however their string nodes also contain all
information about their coordinates including LowerLeft X,Y
and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
They store all these Strings nodes in major y-x axis.

Then they coalesce and merge all string nodes with the
same Y-coordinate first, therefore I was able to extract and
convert into RTF and maintain the same content and format
of PDF file.
I am trying to figure out how to add extra information to your
TextPosition class, so later on, I will be able to traverse thru
major y-axis and build a list of these string nodes.

If you can provide me information needed to obtain all
information about coordinates or position of a text string, I
think I will be able to implement these features. I will
contribute these codes to your project.
I uploaded a header file from XPDF, a sample PDF file which I
tried to convert and an RTF file.
I am not trying to convert "TABLE" from PDF file. I
understand that concept does not exist in PDF.

Thanks,

Tan V. Nguyen

[comment on SourceForge]
Originally sent by benlitchfield.
Logged In: YES
user_id=601708

text in a pdf document is drawn at x/y locations. Which
means there is no relationship to text drawn in a column. If
you can propose an algorithm to determine columns of text
then I will implement it. As a side note, there is no such
thing as a 'table' in a pdf document, only lines drawn between
two points and text drawn at x/y locations. The only way
a 'column' of could be determined is by analyzing lines on the
PDF document, not an easy thing to do.

Ben Litchfield

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Anonymous

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Aug/05 20:11

Updated:: 17/Jun/14 20:05

Resolved:: 17/Jun/14 20:05