Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-100

New plugin urlfilter-db

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Won't Fix
    • 0.8
    • None
    • fetcher
    • None
    • All Nutch versions

    Description

      Hi,

      I have written a new plugin, based on the URLFilter interface: urlfilter-db .

      The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

      The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

      For each url
      filter is called
      end for

      filter
      get the domain name from url
      call cache.get domain
      if not in cache try the database
      if in database cache it and return it
      return null
      end filter

      The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

      Attachments

        1. AddedDbURLFilter.patch
          14 kB
          Gal Nitzan
        2. urlfilter-db.tar.gz
          1.35 MB
          Gal Nitzan
        3. urlfilter-db.tar.gz
          1.73 MB
          Gal Nitzan

        Activity

          People

            Unassigned Unassigned
            gnitzan Gal Nitzan
            Votes:
            2 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: