Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1835

LinkContentHandler skips iframe and rel tags

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: core
    • Labels:
      None
    • Flags:
      Patch, Important

      Description

      As simple as it gets, link and iframe tags were never implemented in LinkContentHandler. NUTCH-1233 kind of requires it.

      1. TIKA-1835.patch
        5 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          markus17 Markus Jelsma added a comment -

          Patch for trunk. Adds support for iframe and link element link extraction. Tests included.

          Show
          markus17 Markus Jelsma added a comment - Patch for trunk. Adds support for iframe and link element link extraction. Tests included.
          Hide
          kkrugler Ken Krugler added a comment - - edited

          Git commit 489ab93..fe841bc. Thanks Markus Jelsma!

          Show
          kkrugler Ken Krugler added a comment - - edited Git commit 489ab93..fe841bc. Thanks Markus Jelsma !
          Hide
          markus17 Markus Jelsma added a comment -

          Thanks Ken!

          Show
          markus17 Markus Jelsma added a comment - Thanks Ken!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #909 (See https://builds.apache.org/job/tika-trunk-jdk1.7/909/)
          Record change for TIKA-1835. (mattmann: rev 542bebc69711f6cf25ad338affe87f163a1eda61)

          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #909 (See https://builds.apache.org/job/tika-trunk-jdk1.7/909/ ) Record change for TIKA-1835 . (mattmann: rev 542bebc69711f6cf25ad338affe87f163a1eda61) CHANGES.txt
          Hide
          kkrugler Ken Krugler added a comment -

          I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links).

          Markus Jelsma - would it be OK to make the above "extra links" support conditional on a new constructor? Or do you think it doesn't matter? And what about adding the script element to that set?

          Show
          kkrugler Ken Krugler added a comment - I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). Markus Jelsma - would it be OK to make the above "extra links" support conditional on a new constructor? Or do you think it doesn't matter? And what about adding the script element to that set?
          Hide
          markus17 Markus Jelsma added a comment -

          Hello Ken - i agree, script src is indeed missing and that's a mistake. We should open a new issue and add script src to it. An additional constructor should not be necessary.

          Show
          markus17 Markus Jelsma added a comment - Hello Ken - i agree, script src is indeed missing and that's a mistake. We should open a new issue and add script src to it. An additional constructor should not be necessary.
          Hide
          naegelejd Joseph Naegele added a comment -

          I opened TIKA-1937 and attached a patch if either of you get a chance to take a look (also check out TIKA-1938 if you have time).

          Show
          naegelejd Joseph Naegele added a comment - I opened TIKA-1937 and attached a patch if either of you get a chance to take a look (also check out TIKA-1938 if you have time).

            People

            • Assignee:
              kkrugler Ken Krugler
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development