Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1495

-normalize and -filter for updatedb command in nutch 2.x

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.2
    • 2.5
    • None
    • None
    • Patch Available

    Description

      AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl, and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve the same in nutch 2.x?

      Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x updatedb command. I have no experience with any of the used technologies including java, so please check the attached code carefully before using it. I'm very interested to hear if this is the right approach or any other comments.

      Attachments

        Activity

          People

            Unassigned Unassigned
            xabbu42 Nathan Gass
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: