Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:

      Description

      Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.

      The PDF2XHTML method loops through the annotations.

      See:

      136: for(Object o : page.getAnnotations()) {
      

      I found some code for dealing with links in annotations:
      http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

      It involves checking the class.

      if( annotation instanceof PDAnnotationLink ) {
                      PDAnnotationLink link = (PDAnnotationLink)annotation;
      

      I hope this helps someone.

      1. TIKA-861.patch
        2 kB
        Ryan Quam
      2. TIKA-861-test.patch
        0.8 kB
        Ryan Quam

        Activity

        Nick Burch made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Nick Burch added a comment -

        Thanks, patches committed in r1331434.

        One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to

        Show
        Nick Burch added a comment - Thanks, patches committed in r1331434. One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to
        Ryan Quam made changes -
        Attachment TIKA-861-test.patch [ 12523975 ]
        Hide
        Ryan Quam added a comment -

        Here is a simple unit test for the PDF link parsing.

        Show
        Ryan Quam added a comment - Here is a simple unit test for the PDF link parsing.
        Hide
        Nick Burch added a comment -

        testPDFVarious.pdf in /tika-parsers/src/test/resources/test-documents/ contains a hyperlink on page one, so would be a good file to use for a unit test

        Is anyone able to work up a unit test for link parsing to go with this patch? (PDFParserTest already has some xhtml based tests, which could be used as a pattern.)

        Show
        Nick Burch added a comment - testPDFVarious.pdf in /tika-parsers/src/test/resources/test-documents/ contains a hyperlink on page one, so would be a good file to use for a unit test Is anyone able to work up a unit test for link parsing to go with this patch? (PDFParserTest already has some xhtml based tests, which could be used as a pattern.)
        Ryan Quam made changes -
        Attachment TIKA-861.patch [ 12523878 ]
        Hide
        Ryan Quam added a comment -

        Patch that adds PDF links to the DOM.

        Show
        Ryan Quam added a comment - Patch that adds PDF links to the DOM.
        Chris A. Mattmann made changes -
        Field Original Value New Value
        Fix Version/s 1.2 [ 12320169 ]
        Fix Version/s 1.1 [ 12318849 ]
        Hide
        Chris A. Mattmann added a comment -
        • push out to 1.2
        Show
        Chris A. Mattmann added a comment - push out to 1.2
        Sasha Goodman created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Sasha Goodman
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 4h
              4h
              Remaining:
              Remaining Estimate - 4h
              4h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development