Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2760

LinkContentHandler does not report hyperlinks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.19
    • 1.20
    • None
    • None

    Description

      Nutch uses LinkContentHandler for collection hyperlinks, and does not report any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also attach to this ticket.

      Debugging LinkContentHandler to print element names in startElement reveals only very few HTML elements get reported, which i think is incorrect.

      Our own parser in Nutch uses a custom ContentHandler and does report many elements, including hyperlinks.

      Attachments

        1. TIKA-2760 - Test for Outlinks.diff
          28 kB
          Dave Meikle
        2. TIKA-2760.patch
          28 kB
          Markus Jelsma
        3. ronaldmcdonald-nolinks.html
          26 kB
          Markus Jelsma

        Activity

          People

            Unassigned Unassigned
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: