Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-7

extract information from tagged PDF

    Details

    • Type: New Feature
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: PDModel
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=805623
      Originally submitted by benlitchfield on 2003-09-13 07:38.

      Add the ability to extract information from a tagged PDF
      document. See taggedPDF.pdf for an example.

      [comment on SourceForge]
      Originally sent by qumar.
      Logged In: YES
      user_id=1468838

      Hi,
      we have to parse the PDF object structure tree; all
      structural elements are inside the object tree (see e.g.
      PDFReference 1.4 chapter 9.6 "Logical Structure").

      • parse the PDF page streams to extract drawing and text
        operations;these contain the actual content of the
        structural elements. This content is surrounded by BMC/EMC
        tags which contain information to which element object the
        contained content belongs.This is what i got from pdf reference.

      Regards,
      Qumar.

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES
      user_id=601708

      http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
      would be a good form to start with.

      If you notice they are putting labels on the form fields.
      these labels contain meta data critical to building tax
      software in rapid fashion. Without this meta data, the
      name of the form field is meaningless. It would be nice to
      extract this information so I can combine it with other
      data about the field (name, type, location, etc). I
      already know PDFBox can extract the other information about
      the fields. I haven't done it with PDFBox, but I did it
      with iText.

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES
      user_id=601708

      More comments from users

      Tagged PDF will be a big thing in government because
      federal government procurement of Acrobat publishing
      technology falls under Section 508. States will likely
      follow.

      see:
      www.section508.gov

      http://www.irs.gov/pub/irs-access/
      or
      ftp://ftp.irs.gov/pub/irs-access/

      [comment on SourceForge]
      Originally sent by qumar.
      Logged In: YES
      user_id=1468838

      Hi,

      i was seeing the specification of pdf and came to know the
      structure information of pdf will be in PDSEdit
      layer,PDSEdit Layer gives access to structure tree with in a
      pdf and methods methods and objects are prefixed by PDS.So
      how can we get access to PDSEdit layer of pdf.

      [comment on SourceForge]
      Originally sent by qumar.
      Logged In: YES
      user_id=1468838

      It would be nice if pdfbox can provide the ability to
      extract information from tagged PDF.As Adobre Acrobat Reader
      provides the tags for the pdf, pdfbox should also try to get
      the tagged pdfs.

      for example if iwe have a pdf file with a para1 under
      header1 and para2 under header 2 and a table with rows and
      columns.something like

      Header1
      This is a para 1 ,it describes about a disease.
      Header2
      This is a para2,describes remedies of disease.
      Table
      A B
      C D

      Now the tagged pdf looks like below in adobe acrobat reader

      <Heading 1>
      Header1
      <Normal>
      This is a para 1 ,it describes about a disease.
      <Heading 1>
      Header1
      <Normal>
      This is a para2,describes remedies of disease.
      <Heading 1>
      Table
      <Table>
      <TBody>
      <TR>
      <TD>
      <Normal>
      A
      <TD>
      <Normal>
      B
      <TR>
      <TD>
      <Normal>
      C
      <TD>
      <Normal>
      D

      how can we extract the Heading1 ,Heading 2 and tabular data
      using pdfbox.

      This is a good feature which should be added to the armory
      pdfbox.

      Please provide this feature.

        Attachments

        1. PDFBOX-7_patch_04.txt
          70 kB
          Johannes Koch
        2. PDFMarkedContentExtractor.properties
          3 kB
          Johannes Koch
        3. PDFBOX-7_patch_03.txt
          66 kB
          Johannes Koch
        4. PDFBOX-7_patch_02.txt
          69 kB
          Johannes Koch
        5. PDFBOX-7_patch_01.txt
          49 kB
          Johannes Koch
        6. PDFBOX-7_patch_00.txt
          22 kB
          Johannes Koch

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Anonymous
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: