Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1233

Rely on Tika for outlink extraction

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.11
    • 1.12
    • parser
    • None
    • Patch Available

    Description

      Tika provides outlink extraction features that are not used in Nutch. To be able to use it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't. There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new version this issue can be worked on. Here's preliminary code that does both Tika and current outlink extraction. This also includes parts of the Boilerpipe code.

      Attachments

        1. pre-1233-2.txt
          19 kB
          Markus Jelsma
        2. pre-1233.txt
          10 kB
          Markus Jelsma
        3. post-1233-2.txt
          18 kB
          Markus Jelsma
        4. post-1233.txt
          10 kB
          Markus Jelsma
        5. NUTCH-1233-1.6-2.patch
          6 kB
          Markus Jelsma
        6. NUTCH-1233-1.6-1.patch
          6 kB
          Markus Jelsma
        7. NUTCH-1233-1.5-wip.patch
          7 kB
          Markus Jelsma
        8. NUTCH-1233.patch
          6 kB
          Markus Jelsma
        9. NUTCH-1233.patch
          6 kB
          Markus Jelsma

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment