Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 1.8
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      All

    • Patch Info:
      Patch Available

      Description

      Adds a Nutch extension point for a fetch filter. The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments. The fliter can return true if content is to be written or false if it is not.

      Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine. In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content. If the content passes, meaning belongs to the set of say sports pages, then we want to include it. If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page. If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments. This effectively stop crawling along the crawl path of that page and the urls from that page. An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words.

      1. NUTCH-828-1-20100608.patch
        23 kB
        Dennis Kubes
      2. NUTCH-828-2-20100608.patch
        23 kB
        Dennis Kubes
      3. NUTCH-828v3.patch
        79 kB
        Lewis John McGibbney

        Activity

        Hide
        Julien Nioche added a comment -

        A better approach is to operate within the parsing step, as explained by Andrzej. You can already remove the outlinks from a page in a HTMLParseFilter and change the status of a page. Moreover there has been little interest in this issue over the last few years

        Show
        Julien Nioche added a comment - A better approach is to operate within the parsing step, as explained by Andrzej. You can already remove the outlinks from a page in a HTMLParseFilter and change the status of a page. Moreover there has been little interest in this issue over the last few years
        Hide
        Lewis John McGibbney added a comment -

        Merely updates Dennis' original patch for 1.8-SNAPSHOT.
        As there is some interest on the mailing list maybe some other can try this out and we can make a decision as to what happens with the issue?
        Thanks

        Show
        Lewis John McGibbney added a comment - Merely updates Dennis' original patch for 1.8-SNAPSHOT. As there is some interest on the mailing list maybe some other can try this out and we can make a decision as to what happens with the issue? Thanks
        Hide
        Markus Jelsma added a comment -

        20120304-push-1.6

        Show
        Markus Jelsma added a comment - 20120304-push-1.6
        Hide
        Zain Us Sami Ahmed added a comment -

        I have implemented this filter to filter out pages not containing urdu language, but this filter kills the seed and hence I am not able to crawl the whole web.

        Show
        Zain Us Sami Ahmed added a comment - I have implemented this filter to filter out pages not containing urdu language, but this filter kills the seed and hence I am not able to crawl the whole web.
        Hide
        Chris A. Mattmann added a comment -
        • bumpity to 1.2 since 1.1 is out the door
        Show
        Chris A. Mattmann added a comment - bumpity to 1.2 since 1.1 is out the door
        Hide
        Dennis Kubes added a comment -

        Nice. I didn't realize the signature update would do that. I am assuming since ParseUtil doesn't interact with the CrawlDatum we are going to have to call the FetchFilters (I am ok with renaming this btw) twice, once in the fetcher and once in the ParseSegment, both dealing with their respective CrawlDatum needs?

        Show
        Dennis Kubes added a comment - Nice. I didn't realize the signature update would do that. I am assuming since ParseUtil doesn't interact with the CrawlDatum we are going to have to call the FetchFilters (I am ok with renaming this btw) twice, once in the fetcher and once in the ParseSegment, both dealing with their respective CrawlDatum needs?
        Hide
        Andrzej Bialecki added a comment -

        First, as you point out, we cannot ignore the page because the problem will repeat itself as we keep re-discovering it, so we have to "poison" it with GONE - and I think it's ok to add another status here to express that we never ever want to collect this page, because GONE gets reset periodically.

        If we run Fetcher in parsing mode then we can change this status immediately, so no problem here. If we run ParseSegment then we can also update this status in a similar way as we implement the signature update, i.e. in ParseOutputFormat emit a <pageUrl,CrawlDatum> that will switch the status of this page when collected later on in CrawlDbReducer.

        Show
        Andrzej Bialecki added a comment - First, as you point out, we cannot ignore the page because the problem will repeat itself as we keep re-discovering it, so we have to "poison" it with GONE - and I think it's ok to add another status here to express that we never ever want to collect this page, because GONE gets reset periodically. If we run Fetcher in parsing mode then we can change this status immediately, so no problem here. If we run ParseSegment then we can also update this status in a similar way as we implement the signature update, i.e. in ParseOutputFormat emit a <pageUrl,CrawlDatum> that will switch the status of this page when collected later on in CrawlDbReducer.
        Hide
        Dennis Kubes added a comment -

        I agree about wanting the decision not just in fetcher while parsing but also in parse segment. Here is the problem as I see it in returning null content.

        Say we are wanting to create a topical search engine about sports. We fetch pages. Run through a fetch filter for a yes/no is this page in sports by its content. If we null out content and from that ParseText and ParseData, we still have the CrawlDatum to deal with. If we leave it as is, the CrawlDatum will get updated into CrawlDb as successfully fetched. Content and Parse won't get collected because they are null. We won't have the problem of Outlinks on that page getting queued in CrawlDb but the original URL will still be there and will be queued after an interval for repeated crawling. Over time what we have is a large number of URLs that we know to be filtered being repeatedly crawled.

        The decision point isn't just keep the content. It is should we keep the URL and its content/parse and continue crawling down the path of the URLs outlinks or should we ignore this URL and not crawl anything it points to, break the crawl graph at this point. Hence FetchFilter. My solution to this was to null out content/parse and add a different CrawlDatum that essentially said the page was gone. Ideally we should have a separate status but the gone worked as a first pass. This gets updated back into CrawlDb and won't get recrawled at a later date This was only possible in the Fetcher though.

        Thoughts on how we might approach this?

        Show
        Dennis Kubes added a comment - I agree about wanting the decision not just in fetcher while parsing but also in parse segment. Here is the problem as I see it in returning null content. Say we are wanting to create a topical search engine about sports. We fetch pages. Run through a fetch filter for a yes/no is this page in sports by its content. If we null out content and from that ParseText and ParseData, we still have the CrawlDatum to deal with. If we leave it as is, the CrawlDatum will get updated into CrawlDb as successfully fetched. Content and Parse won't get collected because they are null. We won't have the problem of Outlinks on that page getting queued in CrawlDb but the original URL will still be there and will be queued after an interval for repeated crawling. Over time what we have is a large number of URLs that we know to be filtered being repeatedly crawled. The decision point isn't just keep the content. It is should we keep the URL and its content/parse and continue crawling down the path of the URLs outlinks or should we ignore this URL and not crawl anything it points to, break the crawl graph at this point. Hence FetchFilter. My solution to this was to null out content/parse and add a different CrawlDatum that essentially said the page was gone. Ideally we should have a separate status but the gone worked as a first pass. This gets updated back into CrawlDb and won't get recrawled at a later date This was only possible in the Fetcher though. Thoughts on how we might approach this?
        Hide
        Andrzej Bialecki added a comment -

        I generally like the idea of a decision point, but I think the place where this decision is taken in this patch (Fetcher) is not right. Since you rely on the presence of ParseResult (understandably so) it seems to me that a much better place to run the filters would be inside ParseUtils.parse(content), and you could return null (or a special ParseResult) to indicate that the content is to be discarded.

        This way you can both run this filtering as a part of a Fetcher in parsing mode, and as a part of ParseSegment, without duplicating the same logic. Consequently, I propose to change the name from FetchFilter to ParseFilter.

        Show
        Andrzej Bialecki added a comment - I generally like the idea of a decision point, but I think the place where this decision is taken in this patch (Fetcher) is not right. Since you rely on the presence of ParseResult (understandably so) it seems to me that a much better place to run the filters would be inside ParseUtils.parse(content), and you could return null (or a special ParseResult) to indicate that the content is to be discarded. This way you can both run this filtering as a part of a Fetcher in parsing mode, and as a part of ParseSegment, without duplicating the same logic. Consequently, I propose to change the name from FetchFilter to ParseFilter.
        Hide
        Dennis Kubes added a comment -

        Forgot to add the nutch-default.xml changes to the old patch. Here is a new one.

        Show
        Dennis Kubes added a comment - Forgot to add the nutch-default.xml changes to the old patch. Here is a new one.
        Hide
        Dennis Kubes added a comment -

        Yeah. I am not proposing to get this into 1.1. Oh wait, I did with the affects selection. No this can / should wait until after the 1.1 release. Anybody that wants it before then can patch

        Show
        Dennis Kubes added a comment - Yeah. I am not proposing to get this into 1.1. Oh wait, I did with the affects selection. No this can / should wait until after the 1.1 release. Anybody that wants it before then can patch
        Hide
        Julien Nioche added a comment -

        Shall we postpone this after the release of 1.1? This is a new functionality and at this stage we probably just want to iron out bugs on what we currently have. Makes sense?

        Show
        Julien Nioche added a comment - Shall we postpone this after the release of 1.1? This is a new functionality and at this stage we probably just want to iron out bugs on what we currently have. Makes sense?

          People

          • Assignee:
            Dennis Kubes
            Reporter:
            Dennis Kubes
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development