Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1937

LinkContentHandler skips script tags

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12
    • Fix Version/s: 2.0, 1.13
    • Component/s: core
    • Labels:
      None
    • Flags:
      Patch, Important

      Description

      Just like in TIKA-1835, <script> tags are not collected by LinkContentHandler. The difference between <script> and the other tags is that <script> tags that do not contain a "src" attribute are not links.

      1. TIKA-1937.patch
        5 kB
        Joseph Naegele

        Issue Links

          Activity

          Hide
          naegelejd Joseph Naegele added a comment -

          Patch for master branch. Adds support for script link extraction, tests included.

          Note the changes needed to verify that the script tag contains a "src=..." attribute, otherwise the script is embedded and not a link.

          Show
          naegelejd Joseph Naegele added a comment - Patch for master branch. Adds support for script link extraction, tests included. Note the changes needed to verify that the script tag contains a "src=..." attribute, otherwise the script is embedded and not a link.
          Hide
          naegelejd Joseph Naegele added a comment -

          Is it still possible to address this for the 1.13 release? Also, should I submit a PR on Github or is attaching a patch here sufficient? Either way works for me.

          Show
          naegelejd Joseph Naegele added a comment - Is it still possible to address this for the 1.13 release? Also, should I submit a PR on Github or is attaching a patch here sufficient? Either way works for me.
          Hide
          gagravarr Nick Burch added a comment -

          Both patches attached to JIRAs and Github Pull Requests (referencing the JIRA ID) work fine for us, please do whatever is easiest for yourself!

          Show
          gagravarr Nick Burch added a comment - Both patches attached to JIRAs and Github Pull Requests (referencing the JIRA ID) work fine for us, please do whatever is easiest for yourself!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Joseph Naegele.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Joseph Naegele .
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-2.x #83 (See https://builds.apache.org/job/tika-2.x/83/)
          TIKA-1937 LinkContentHandler wasn't extracting links from script tags (tallison: rev 3f7e3412a58e897178536f5595632cd2ad877f9c)

          • tika-core/src/main/java/org/apache/tika/sax/LinkBuilder.java
          • tika-core/src/main/java/org/apache/tika/sax/LinkContentHandler.java
          • tika-core/src/test/java/org/apache/tika/sax/LinkContentHandlerTest.java
          • tika-core/src/main/java/org/apache/tika/sax/Link.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #83 (See https://builds.apache.org/job/tika-2.x/83/ ) TIKA-1937 LinkContentHandler wasn't extracting links from script tags (tallison: rev 3f7e3412a58e897178536f5595632cd2ad877f9c) tika-core/src/main/java/org/apache/tika/sax/LinkBuilder.java tika-core/src/main/java/org/apache/tika/sax/LinkContentHandler.java tika-core/src/test/java/org/apache/tika/sax/LinkContentHandlerTest.java tika-core/src/main/java/org/apache/tika/sax/Link.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #955 (See https://builds.apache.org/job/tika-trunk-jdk1.7/955/)
          TIKA-1937 LinkContentHandler wasn't extracting links from script tags (tallison: rev 1c5e96cf617b231ce8d902fc86eca84edb9cafe7)

          • tika-core/src/test/java/org/apache/tika/sax/LinkContentHandlerTest.java
          • tika-core/src/main/java/org/apache/tika/sax/Link.java
          • tika-core/src/main/java/org/apache/tika/sax/LinkContentHandler.java
          • CHANGES.txt
          • tika-core/src/main/java/org/apache/tika/sax/LinkBuilder.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #955 (See https://builds.apache.org/job/tika-trunk-jdk1.7/955/ ) TIKA-1937 LinkContentHandler wasn't extracting links from script tags (tallison: rev 1c5e96cf617b231ce8d902fc86eca84edb9cafe7) tika-core/src/test/java/org/apache/tika/sax/LinkContentHandlerTest.java tika-core/src/main/java/org/apache/tika/sax/Link.java tika-core/src/main/java/org/apache/tika/sax/LinkContentHandler.java CHANGES.txt tika-core/src/main/java/org/apache/tika/sax/LinkBuilder.java

            People

            • Assignee:
              Unassigned
              Reporter:
              naegelejd Joseph Naegele
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development