Nutch
  1. Nutch
  2. NUTCH-628

Host database to keep track of host-level information

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fetcher, generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score.

      From the recent thread on nutch-user@lucene:

      Otis asked:
      > While we are at it, how would one go about implementing this DB, as far as its structures go?

      Andrzej said:
      The easiest I can imagine is to use something like <Text, MapWritable>.
      This way you could store arbitrary information under arbitrary keys.
      I.e. a single database then could keep track of aggregate statistics at
      different levels, e.g. TLD, domain, host, ip range, etc. The basic set
      of statistics could consist of a few predefined gauges, totals and averages.

      1. NUTCH-628-DomainStatistics.patch
        7 kB
        Otis Gospodnetic
      2. NUTCH-628-HostDb.patch
        17 kB
        Otis Gospodnetic
      3. domain_statistics_v2.patch
        7 kB
        Doğacan Güney

        Issue Links

          Activity

          Hide
          Ferdy Galema added a comment -

          This one should be closed as it is already implemented by various related issues. Please re-open if you do not agree.

          Show
          Ferdy Galema added a comment - This one should be closed as it is already implemented by various related issues. Please re-open if you do not agree.
          Hide
          Lewis John McGibbney added a comment -

          Hi Markus, can you confirm if this has been completely integrated and that we have all the functionality from this issue? Thanks

          Show
          Lewis John McGibbney added a comment - Hi Markus, can you confirm if this has been completely integrated and that we have all the functionality from this issue? Thanks
          Hide
          Markus Jelsma added a comment -

          Yes, i think this one can be resolved. Command should be added to bin/nutch:
          https://issues.apache.org/jira/browse/NUTCH-1049

          Show
          Markus Jelsma added a comment - Yes, i think this one can be resolved. Command should be added to bin/nutch: https://issues.apache.org/jira/browse/NUTCH-1049
          Hide
          Lewis John McGibbney added a comment - - edited

          From previous discussion on this ticket I think there is evidence that this class has some useful credentials. The problem is that the issue is still open and that there is no entry for this in current Nutch 1.3 /bin/nutch script. Is it worth while providing a patch for this?

          Show
          Lewis John McGibbney added a comment - - edited From previous discussion on this ticket I think there is evidence that this class has some useful credentials. The problem is that the issue is still open and that there is no entry for this in current Nutch 1.3 /bin/nutch script. Is it worth while providing a patch for this?
          Hide
          Alex McLintock added a comment -

          As part of this fix can someone check that the documentation is up to date too.

          I've added a page to our wiki based upon the example above (and yes - it seems to work with a very recent 1.1RC)

          http://wiki.apache.org/nutch/DomainStatistics

          I'm not happy I understand the parameters though eg

          nutch@reynolds:/nutch/search$ bin/nutch org.apache.nutch.util.domain.DomainStatistics crawl/crawldb ds-host host 2
          Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://rio46:9000/user/nutch/crawl/crawldb/current/data
          at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
          at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
          at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
          at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
          at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
          at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
          at org.apache.nutch.util.domain.DomainStatistics.run(DomainStatistics.java:113)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.nutch.util.domain.DomainStatistics.main(DomainStatistics.java:204)
          nutch@reynolds:/nutch/search$

          But this worked.

          nutch@reynolds:/nutch/search$ bin/nutch org.apache.nutch.util.domain.DomainStatistics hdfs://rio46:9000/user/nutch/crawl/crawldb/current/ ds-host host 2

          Show
          Alex McLintock added a comment - As part of this fix can someone check that the documentation is up to date too. I've added a page to our wiki based upon the example above (and yes - it seems to work with a very recent 1.1RC) http://wiki.apache.org/nutch/DomainStatistics I'm not happy I understand the parameters though eg nutch@reynolds:/nutch/search$ bin/nutch org.apache.nutch.util.domain.DomainStatistics crawl/crawldb ds-host host 2 Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://rio46:9000/user/nutch/crawl/crawldb/current/data at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.util.domain.DomainStatistics.run(DomainStatistics.java:113) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.util.domain.DomainStatistics.main(DomainStatistics.java:204) nutch@reynolds:/nutch/search$ But this worked. nutch@reynolds:/nutch/search$ bin/nutch org.apache.nutch.util.domain.DomainStatistics hdfs://rio46:9000/user/nutch/crawl/crawldb/current/ ds-host host 2
          Hide
          Chris A. Mattmann added a comment -
          Show
          Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
          Hide
          Doğacan Güney added a comment -

          This tool can also read crawl_fetch and other directories as well. And that is the problem. If you are reading crawl_fetch
          MapFile parts are right under there but for crawldb, MapFile parts are under crawldb/current. I guess we can add a special case for any path that ends in "crawldb" but this is not a complete fix either as someone else may rename his crawl database something else.

          Show
          Doğacan Güney added a comment - This tool can also read crawl_fetch and other directories as well. And that is the problem. If you are reading crawl_fetch MapFile parts are right under there but for crawldb, MapFile parts are under crawldb/current. I guess we can add a special case for any path that ends in "crawldb" but this is not a complete fix either as someone else may rename his crawl database something else.
          Hide
          Andrzej Bialecki added a comment -

          I agree that the crawldb/current/ subdir is an implementation detail that should be hidden from users. All other tools take the name of the parent directory (crawldb/), so I see no reason why this tool should do it differently.

          Show
          Andrzej Bialecki added a comment - I agree that the crawldb/current/ subdir is an implementation detail that should be hidden from users. All other tools take the name of the parent directory (crawldb/), so I see no reason why this tool should do it differently.
          Hide
          Doğacan Güney added a comment -

          When someone thinks of crawldb, he would probably think of "crawldb" directory and not crawldb/current since
          current is pretty much an implementation detail (so that jobs that change crawldb can write their results to a temp directory under crawldb first then this dir can move to crawldb/current).

          So, it is not exactly bad to refer to current, it is just that it may be counter-intuitive for people, who may try to pass crawldb directory to DomainStatistics. Maybe we can add some documentation to command line?

          What do you think?

          Show
          Doğacan Güney added a comment - When someone thinks of crawldb, he would probably think of "crawldb" directory and not crawldb/current since current is pretty much an implementation detail (so that jobs that change crawldb can write their results to a temp directory under crawldb first then this dir can move to crawldb/current). So, it is not exactly bad to refer to current, it is just that it may be counter-intuitive for people, who may try to pass crawldb directory to DomainStatistics. Maybe we can add some documentation to command line? What do you think?
          Hide
          Otis Gospodnetic added a comment -

          Thanks for the update. Sorry, I don't recall the details around crawldb/current... is referring to "current" bad?

          Show
          Otis Gospodnetic added a comment - Thanks for the update. Sorry, I don't recall the details around crawldb/current... is referring to "current" bad?
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #707 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/707/)

          • DomainStatistics tool
          Show
          Hudson added a comment - Integrated in Nutch-trunk #707 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/707/ ) DomainStatistics tool
          Hide
          Doğacan Güney added a comment -

          DomainStatistics is committed as of rev. 738175 .

          I am leaving this issue open. We can deal with it after 1.0.

          Show
          Doğacan Güney added a comment - DomainStatistics is committed as of rev. 738175 . I am leaving this issue open. We can deal with it after 1.0.
          Hide
          Doğacan Güney added a comment -

          Here is an update to DomainStatistics patch so that it compiles with latest trunk.

          Otis, it seems the tool works when given crawldb/current and not just crawldb. Is this
          intended or did I mess something up ?

          Show
          Doğacan Güney added a comment - Here is an update to DomainStatistics patch so that it compiles with latest trunk. Otis, it seems the tool works when given crawldb/current and not just crawldb. Is this intended or did I mess something up ?
          Hide
          Otis Gospodnetic added a comment -

          Could you take it if you have time, please?

          Show
          Otis Gospodnetic added a comment - Could you take it if you have time, please?
          Hide
          Doğacan Güney added a comment -

          I don't know much about the patch here. Otis, do you have time to update and commit Domain Stats? If not, I will take a look.

          Show
          Doğacan Güney added a comment - I don't know much about the patch here. Otis, do you have time to update and commit Domain Stats? If not, I will take a look.
          Hide
          Otis Gospodnetic added a comment -

          I'm +1 on getting Domain Stats into 1.0. The patch will need a small update, I think.

          Show
          Otis Gospodnetic added a comment - I'm +1 on getting Domain Stats into 1.0. The patch will need a small update, I think.
          Hide
          Doğacan Güney added a comment -

          I don't know if this issue should be closed or not, but I am moving it to 1.1.

          (Should Domain Statistics tool be in 1.0?);

          Show
          Doğacan Güney added a comment - I don't know if this issue should be closed or not, but I am moving it to 1.1. (Should Domain Statistics tool be in 1.0?);
          Hide
          Otis Gospodnetic added a comment -

          After seeing NUTCH-650 I have a feeling this issue should be closed with "Won't Fix". Thoughts?

          Does it make sense to save and commit the Domain Statistics patch, though?
          (to be ported to Hbase approach later, once Hbase stuff from NUTCH-650 is in)

          Show
          Otis Gospodnetic added a comment - After seeing NUTCH-650 I have a feeling this issue should be closed with "Won't Fix". Thoughts? Does it make sense to save and commit the Domain Statistics patch, though? (to be ported to Hbase approach later, once Hbase stuff from NUTCH-650 is in)
          Hide
          Andrzej Bialecki added a comment -

          Not everything looks like a String MapWritable is useful in situations where you need to (de)serialize non-String types. And most of the information in HostDb is numeric, so if we decided to use simple Metadata it would cause constant pointless conversion from/to Strings.

          Having said that, I'm for a specialized class (which can contain MapWritable as a placeholder for anything else than the specific built-in types of info).

          Show
          Andrzej Bialecki added a comment - Not everything looks like a String MapWritable is useful in situations where you need to (de)serialize non-String types. And most of the information in HostDb is numeric, so if we decided to use simple Metadata it would cause constant pointless conversion from/to Strings. Having said that, I'm for a specialized class (which can contain MapWritable as a placeholder for anything else than the specific built-in types of info).
          Hide
          Doğacan Güney added a comment -

          +1 for extracting hostdb from crawldb...

          (also, do we really want to make hostdb just a map file of <Text,MapWritable>? IMHO, it would be better to design a proper HostDatum class with some statistics built-in, and then maybe a Metadata element [I guess it's just me but I hate MapWritable, I prefer Metadata:D])

          Show
          Doğacan Güney added a comment - +1 for extracting hostdb from crawldb... (also, do we really want to make hostdb just a map file of <Text,MapWritable>? IMHO, it would be better to design a proper HostDatum class with some statistics built-in, and then maybe a Metadata element [I guess it's just me but I hate MapWritable, I prefer Metadata:D] )
          Hide
          Andrzej Bialecki added a comment -

          IMHO a better option would be to put this data into CrawlDb, and then maintain HostDB data using CrawlDb as the source. The reason is that segments may contain duplicate urls, they may be missing,may be unparsed, etc - in short, they are transient and not unique. Whereas a CrawlDb is a persistent store of our knowledge about all known urls, and contains only unique urls.

          So, I think that Fetcher-s should put this information in crawl_fetch, the updatedb should stick this information into CrawlDb-s CrawlDatum (this should happen automatically), and the HostDb would simply perform an aggregation of this info from CrawlDb, using hostname / domain name / tld as the keys.

          Show
          Andrzej Bialecki added a comment - IMHO a better option would be to put this data into CrawlDb, and then maintain HostDB data using CrawlDb as the source. The reason is that segments may contain duplicate urls, they may be missing,may be unparsed, etc - in short, they are transient and not unique. Whereas a CrawlDb is a persistent store of our knowledge about all known urls, and contains only unique urls. So, I think that Fetcher-s should put this information in crawl_fetch, the updatedb should stick this information into CrawlDb-s CrawlDatum (this should happen automatically), and the HostDb would simply perform an aggregation of this info from CrawlDb, using hostname / domain name / tld as the keys.
          Hide
          Otis Gospodnetic added a comment - - edited

          HostDatum.java

          • really just a holds MapWritable

          HostDb.java

          • can read an existing HostDb (MapReduce job)
          • can merge host info from segments into the main HostDb (MapReduce job)

          The above classes are in the patch. Their descriptions are what the plan is and where the patch is headed. While I have not run/tested this code yet, I would very much appreciate if others could have a look and comment on the approach, and have a look at the 2 inner Mapper and 2 inner Reducer classes.

          As for where the host data will come from, I intend to modify Fetcher2 to dump host stats (number of requests, successes, failures, exceptions, timeouts, etc.) to, say, fetch_hosts file in the current segment. At this point I don't know what the best file format would be for that, so please .... show me the way.

          Show
          Otis Gospodnetic added a comment - - edited HostDatum.java really just a holds MapWritable HostDb.java can read an existing HostDb (MapReduce job) can merge host info from segments into the main HostDb (MapReduce job) The above classes are in the patch. Their descriptions are what the plan is and where the patch is headed. While I have not run/tested this code yet, I would very much appreciate if others could have a look and comment on the approach, and have a look at the 2 inner Mapper and 2 inner Reducer classes. As for where the host data will come from, I intend to modify Fetcher2 to dump host stats (number of requests, successes, failures, exceptions, timeouts, etc.) to, say, fetch_hosts file in the current segment. At this point I don't know what the best file format would be for that, so please .... show me the way.
          Hide
          Otis Gospodnetic added a comment - - edited

          Enis' DomainStatistics tool from NUTCH-439.
          (not a solution to this issue, just something that may go well with it)

          Here is example usage, for anyone who wants to try DomainStatistics (works nicely):

          $ bin/nutch org.apache.nutch.util.domain.DomainStatistics
          hdfs://nn:9000/user/otis/crawl/crawldb/current
          hdfs://nn:9000/user/otis/ds-host host 8

          You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting by count, higher count first.

          Show
          Otis Gospodnetic added a comment - - edited Enis' DomainStatistics tool from NUTCH-439 . (not a solution to this issue, just something that may go well with it) Here is example usage, for anyone who wants to try DomainStatistics (works nicely): $ bin/nutch org.apache.nutch.util.domain.DomainStatistics hdfs://nn:9000/user/otis/crawl/crawldb/current hdfs://nn:9000/user/otis/ds-host host 8 You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting by count, higher count first.

            People

            • Assignee:
              Unassigned
              Reporter:
              Otis Gospodnetic
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development