Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-861

Parse links in PDF

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.0
    • 1.2
    • parser

    Description

      Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.

      The PDF2XHTML method loops through the annotations.

      See:

      136: for(Object o : page.getAnnotations()) {
      

      I found some code for dealing with links in annotations:
      http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

      It involves checking the class.

      if( annotation instanceof PDAnnotationLink ) {
                      PDAnnotationLink link = (PDAnnotationLink)annotation;
      

      I hope this helps someone.

      Attachments

        1. TIKA-861-test.patch
          0.8 kB
          Ryan Quam
        2. TIKA-861.patch
          2 kB
          Ryan Quam

        Activity

          People

            Unassigned Unassigned
            zoomby Sasha Goodman
            Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 4h
                4h
                Remaining:
                Remaining Estimate - 4h
                4h
                Logged:
                Time Spent - Not Specified
                Not Specified