[PDFBOX-7] extract information from tagged PDF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: PDModel
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=805623
Originally submitted by benlitchfield on 2003-09-13 07:38.

Add the ability to extract information from a tagged PDF
document. See taggedPDF.pdf for an example.

[comment on SourceForge]
Originally sent by qumar.
Logged In: YES
user_id=1468838

Hi,
we have to parse the PDF object structure tree; all
structural elements are inside the object tree (see e.g.
PDFReference 1.4 chapter 9.6 "Logical Structure").

parse the PDF page streams to extract drawing and text
operations;these contain the actual content of the
structural elements. This content is surrounded by BMC/EMC
tags which contain information to which element object the
contained content belongs.This is what i got from pdf reference.

Regards,
Qumar.

[comment on SourceForge]
Originally sent by benlitchfield.
Logged In: YES
user_id=601708

http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
would be a good form to start with.

If you notice they are putting labels on the form fields.
these labels contain meta data critical to building tax
software in rapid fashion. Without this meta data, the
name of the form field is meaningless. It would be nice to
extract this information so I can combine it with other
data about the field (name, type, location, etc). I
already know PDFBox can extract the other information about
the fields. I haven't done it with PDFBox, but I did it
with iText.

[comment on SourceForge]
Originally sent by benlitchfield.
Logged In: YES
user_id=601708

More comments from users

Tagged PDF will be a big thing in government because
federal government procurement of Acrobat publishing
technology falls under Section 508. States will likely
follow.

see:
www.section508.gov

http://www.irs.gov/pub/irs-access/
or
ftp://ftp.irs.gov/pub/irs-access/

[comment on SourceForge]
Originally sent by qumar.
Logged In: YES
user_id=1468838

Hi,

i was seeing the specification of pdf and came to know the
structure information of pdf will be in PDSEdit
layer,PDSEdit Layer gives access to structure tree with in a
pdf and methods methods and objects are prefixed by PDS.So
how can we get access to PDSEdit layer of pdf.

[comment on SourceForge]
Originally sent by qumar.
Logged In: YES
user_id=1468838

It would be nice if pdfbox can provide the ability to
extract information from tagged PDF.As Adobre Acrobat Reader
provides the tags for the pdf, pdfbox should also try to get
the tagged pdfs.

for example if iwe have a pdf file with a para1 under
header1 and para2 under header 2 and a table with rows and
columns.something like

Header1
This is a para 1 ,it describes about a disease.
Header2
This is a para2,describes remedies of disease.
Table
A B
C D

Now the tagged pdf looks like below in adobe acrobat reader

<Heading 1>
Header1
<Normal>
This is a para 1 ,it describes about a disease.
<Heading 1>
Header1
<Normal>
This is a para2,describes remedies of disease.
<Heading 1>
Table
<Table>
<TBody>
<TR>
<TD>
<Normal>
A
<TD>
<Normal>
B
<TR>
<TD>
<Normal>
C
<TD>
<Normal>
D

how can we extract the Heading1 ,Heading 2 and tabular data
using pdfbox.

This is a good feature which should be added to the armory
pdfbox.

Please provide this feature.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-7_patch_00.txt
21/Dec/09 13:15
22 kB
Johannes Koch
PDFBOX-7_patch_01.txt
13/Jan/10 14:57
49 kB
Johannes Koch
PDFBOX-7_patch_02.txt
02/Mar/10 14:47
69 kB
Johannes Koch
PDFBOX-7_patch_03.txt
02/Mar/10 15:12
66 kB
Johannes Koch
PDFBOX-7_patch_04.txt
08/Mar/10 09:07
70 kB
Johannes Koch
PDFMarkedContentExtractor.properties
02/Mar/10 15:15
3 kB
Johannes Koch

Issue Links

relates to

PDFBOX-48 Create a tagged PDF

Closed

PDFBOX-67 Implement StructTreeRoot/StructTree classes in the PDModel

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Anonymous

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 13/Sep/03 14:38

Updated:: 30/Mar/10 08:23

Resolved:: 11/Mar/10 08:38