Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: None
    • Labels:
      None

      Description

      A HostDB for Nutch and associated tools to create and read a database containing information on hosts.

      1. NUTCH-1325-trunk-v4.patch
        45 kB
        Tejas Patil
      2. NUTCH-1325-trunk-v3.patch
        44 kB
        Tejas Patil
      3. NUTCH-1325-removed-from-1.8.patch
        44 kB
        Markus Jelsma
      4. NUTCH-1325-1.6-1.patch
        43 kB
        Markus Jelsma
      5. NUTCH-1325.trunk.v2.path
        44 kB
        Tejas Patil

        Activity

        Hide
        Markus Jelsma added a comment -

        Committed revision 1575282.
        It is removed!

        Show
        Markus Jelsma added a comment - Committed revision 1575282. It is removed!
        Hide
        Markus Jelsma added a comment -

        Patch to remove HostDB from 1.8 trunk

        Show
        Markus Jelsma added a comment - Patch to remove HostDB from 1.8 trunk
        Hide
        Lewis John McGibbney added a comment -

        +1. Best to be safe than sorry. We can come back to this in due course.

        Show
        Lewis John McGibbney added a comment - +1. Best to be safe than sorry. We can come back to this in due course.
        Hide
        Markus Jelsma added a comment -

        That is no problem Tejas, let us take the HostDB out of 1.8 again and fix it for 1.9?

        Show
        Markus Jelsma added a comment - That is no problem Tejas, let us take the HostDB out of 1.8 again and fix it for 1.9?
        Hide
        Lewis John McGibbney added a comment -

        Boooooo

        Show
        Lewis John McGibbney added a comment - Boooooo
        Hide
        Tejas Patil added a comment -

        It would take me few weeks before I can work on this one. The reason being: I have recently left school and started working at a company. There is some legal paperwork that I would have to finish off to work on open source projects (even if its during my free time).

        Show
        Tejas Patil added a comment - It would take me few weeks before I can work on this one. The reason being: I have recently left school and started working at a company. There is some legal paperwork that I would have to finish off to work on open source projects (even if its during my free time).
        Hide
        Markus Jelsma added a comment -

        It seems it does work somehow but when dumping the hostdb i see DNS failures for all (which is incorrect because DNS works and hosts are valid) and all statistics are zero. Homepages are also missing.

        http://www.example.org/     Version: 1
        Homepage url: 
        Score: 0.0
        Last check: 2014-03-04 13:05:50
        Total records: 0
          Unfetched: 0
          Fetched: 0
          Gone: 0
          Perm redirect: 0
          Temp redirect: 0
          Not modified: 0
        Total failures: 1
          DNS failures: 1
          Connection failures: 0
        
        Show
        Markus Jelsma added a comment - It seems it does work somehow but when dumping the hostdb i see DNS failures for all (which is incorrect because DNS works and hosts are valid) and all statistics are zero. Homepages are also missing. http: //www.example.org/ Version: 1 Homepage url: Score: 0.0 Last check: 2014-03-04 13:05:50 Total records: 0 Unfetched: 0 Fetched: 0 Gone: 0 Perm redirect: 0 Temp redirect: 0 Not modified: 0 Total failures: 1 DNS failures: 1 Connection failures: 0
        Hide
        Markus Jelsma added a comment - - edited

        forget this comment

        Show
        Markus Jelsma added a comment - - edited forget this comment
        Hide
        Markus Jelsma added a comment -

        Hi Tejas, can you check this out before 1.8? I cannot seem to get it to work properly.

        markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch hostdb -Dplugin.includes="urlfilter-(domain)" crawl/hostdb -crawldb crawl/crawldb/  -checkAll
        HostDb: crawldb: crawl/crawldb
        HostDb: checking all hosts
        HostDb: starting at 2014-03-04 14:02:45
        http://.../: existing_unknown_host Version: 1
        Homepage url: 
        Score: 0.0
        Last check: 2014-03-04 14:02:47
        Total records: 0
          Unfetched: 0
          Fetched: 0
          Gone: 0
          Perm redirect: 0
          Temp redirect: 0
          Not modified: 0
        Total failures: 1
          DNS failures: 1
          Connection failures: 0
        
        java.lang.NullPointerException
                at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1030)
                at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1072)
                at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
                at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:586)
                at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
                at org.apache.nutch.util.hostdb.HostDb$HostDbReducer$ResolverThread.run(HostDb.java:469)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:744)
        
        Show
        Markus Jelsma added a comment - Hi Tejas, can you check this out before 1.8? I cannot seem to get it to work properly. markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch hostdb -Dplugin.includes= "urlfilter-(domain)" crawl/hostdb -crawldb crawl/crawldb/ -checkAll HostDb: crawldb: crawl/crawldb HostDb: checking all hosts HostDb: starting at 2014-03-04 14:02:45 http: //.../: existing_unknown_host Version: 1 Homepage url: Score: 0.0 Last check: 2014-03-04 14:02:47 Total records: 0 Unfetched: 0 Fetched: 0 Gone: 0 Perm redirect: 0 Temp redirect: 0 Not modified: 0 Total failures: 1 DNS failures: 1 Connection failures: 0 java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1030) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1072) at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:586) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.nutch.util.hostdb.HostDb$HostDbReducer$ResolverThread.run(HostDb.java:469) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744)
        Hide
        Tejas Patil added a comment -

        Hi Markus Jelsma,
        Thanks for the correction. This feature would have not been without you in the first place. Apart from being a good addition to Nutch, HostDb has also helped in getting a simple design for Sitemap feature (NUTCH-1465).

        Cheers !!!

        Show
        Tejas Patil added a comment - Hi Markus Jelsma , Thanks for the correction. This feature would have not been without you in the first place. Apart from being a good addition to Nutch, HostDb has also helped in getting a simple design for Sitemap feature ( NUTCH-1465 ). Cheers !!!
        Hide
        Markus Jelsma added a comment -

        conf/log4j.properties has two dots in the classpath for hostdb rules.
        Committed revision 1560327.

        Show
        Markus Jelsma added a comment - conf/log4j.properties has two dots in the classpath for hostdb rules. Committed revision 1560327.
        Hide
        Lewis John McGibbney added a comment -

        yeah Tejas this is a belter

        Show
        Lewis John McGibbney added a comment - yeah Tejas this is a belter
        Hide
        Markus Jelsma added a comment -

        Thanks a lot Tejas for spending your time on fixing the loose ends i left in the original work. I'll migrate your code back to our own Nutch.
        Great work!

        Show
        Markus Jelsma added a comment - Thanks a lot Tejas for spending your time on fixing the loose ends i left in the original work. I'll migrate your code back to our own Nutch. Great work!
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Nutch-trunk #2501 (See https://builds.apache.org/job/Nutch-trunk/2501/)
        NUTCH-1325 HostDB for Nutch (tejasp: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1560316)

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/conf/log4j.properties
        • /nutch/trunk/src/bin/nutch
        • /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
        • /nutch/trunk/src/java/org/apache/nutch/util/hostdb
        • /nutch/trunk/src/java/org/apache/nutch/util/hostdb/DumpHostDb.java
        • /nutch/trunk/src/java/org/apache/nutch/util/hostdb/HostDatum.java
        • /nutch/trunk/src/java/org/apache/nutch/util/hostdb/HostDb.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2501 (See https://builds.apache.org/job/Nutch-trunk/2501/ ) NUTCH-1325 HostDB for Nutch (tejasp: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1560316 ) /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties /nutch/trunk/src/bin/nutch /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java /nutch/trunk/src/java/org/apache/nutch/util/hostdb /nutch/trunk/src/java/org/apache/nutch/util/hostdb/DumpHostDb.java /nutch/trunk/src/java/org/apache/nutch/util/hostdb/HostDatum.java /nutch/trunk/src/java/org/apache/nutch/util/hostdb/HostDb.java
        Hide
        Tejas Patil added a comment -

        Thanks Markus Jelsma for the heads up I have committed the patch to trunk (rev 1560316).

        Show
        Tejas Patil added a comment - Thanks Markus Jelsma for the heads up I have committed the patch to trunk (rev 1560316).
        Hide
        Markus Jelsma added a comment -

        Hi Tejas - i am fine with the changes you have uploaded yesterday although the filtering thing is still as ugly as it was, although it works for https now

        Show
        Markus Jelsma added a comment - Hi Tejas - i am fine with the changes you have uploaded yesterday although the filtering thing is still as ugly as it was, although it works for https now
        Hide
        Tejas Patil added a comment -

        Attaching NUTCH-1325-trunk-v4.patch with following changes:

        • Fixed filterNormalize() to prevent from incorrectly pre-pending "http://" to normal urls.
        • Migrated HostDb to new map-reduce API
        Show
        Tejas Patil added a comment - Attaching NUTCH-1325 -trunk-v4.patch with following changes: Fixed filterNormalize() to prevent from incorrectly pre-pending "http://" to normal urls. Migrated HostDb to new map-reduce API
        Hide
        Lewis John McGibbney added a comment -

        Hey Tejas Patil, great work on this one. I'll patch up and give this a spin tomorrow.

        Show
        Lewis John McGibbney added a comment - Hey Tejas Patil , great work on this one. I'll patch up and give this a spin tomorrow.
        Hide
        Tejas Patil added a comment -

        Could anyone please look at the patch and let us know if there are any flaws or improvements that must be addressed ?

        Show
        Tejas Patil added a comment - Could anyone please look at the patch and let us know if there are any flaws or improvements that must be addressed ?
        Hide
        Markus Jelsma added a comment -

        Hi Tejas - i think most seems fine now and i like the changes you've made so far and i cannot come up with a better solution right now for the https:// schema filtering issue.

        Or there any other issues we didn't think about? Anyone else

        Show
        Markus Jelsma added a comment - Hi Tejas - i think most seems fine now and i like the changes you've made so far and i cannot come up with a better solution right now for the https:// schema filtering issue. Or there any other issues we didn't think about? Anyone else
        Hide
        Tejas Patil added a comment -

        A final patch (NUTCH-1325-trunk-v3.patch) to complete this feature.
        Uploaded the patch over review board too: https://reviews.apache.org/r/16555/

        Comments are welcome !!!

        Show
        Tejas Patil added a comment - A final patch ( NUTCH-1325 -trunk-v3.patch) to complete this feature. Uploaded the patch over review board too: https://reviews.apache.org/r/16555/ Comments are welcome !!!
        Hide
        Markus Jelsma added a comment -

        Hi Tejas,

        (1):
        Current mapper is:

                if (datum.numFailures() >= failureThreshold) {
        
                  // TODO: also write to external storage, i.e. memcache
                  context.write(key, emptyText);
                }
        

        If we change this to:

                  context.write(key, datum.numFailures());
        

        Then in the reducer we can check if all hosts have failed, then emit domain name. If one host hasn't failed, we have to emit all the failed host names.

        (2):
        Perhaps we can retry with https:// and other scheme's if the first fails with http://. It is ugly but should work.

        Cheers,

        Show
        Markus Jelsma added a comment - Hi Tejas, (1): Current mapper is: if (datum.numFailures() >= failureThreshold) { // TODO: also write to external storage, i.e. memcache context.write(key, emptyText); } If we change this to: context.write(key, datum.numFailures()); Then in the reducer we can check if all hosts have failed, then emit domain name. If one host hasn't failed, we have to emit all the failed host names. (2): Perhaps we can retry with https:// and other scheme's if the first fails with http:// . It is ugly but should work. Cheers,
        Hide
        Tejas Patil added a comment -

        Hi Markus Jelsma,
        I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk.
        You had replied to my two concerns.

        For (1):

        host_a.example.org, host_b.example.org ==> example.org

        This might NOT be a good idea.
        (a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts.
        (b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".

        For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed."
        Do you have any suggestion to work this out ?

        Show
        Tejas Patil added a comment - Hi Markus Jelsma , I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk. You had replied to my two concerns. For (1): host_a.example.org, host_b.example.org ==> example.org This might NOT be a good idea. (a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts. (b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu". For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed." Do you have any suggestion to work this out ?
        Hide
        Lewis John McGibbney added a comment -

        Hi Otis Gospodnetic there is already a host table implementation in 2.x

        Show
        Lewis John McGibbney added a comment - Hi Otis Gospodnetic there is already a host table implementation in 2.x
        Hide
        Otis Gospodnetic added a comment -

        This seems very useful to me. Any reason not to commit this now? (would be nice to have it in 2.x, too)

        Show
        Otis Gospodnetic added a comment - This seems very useful to me. Any reason not to commit this now? (would be nice to have it in 2.x, too)
        Hide
        Tejas Patil added a comment -

        Hi Markus Jelsma,

        > think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.

        Could you kindly upload the newer version ?

        Show
        Tejas Patil added a comment - Hi Markus Jelsma , > think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it. Could you kindly upload the newer version ?
        Hide
        Markus Jelsma added a comment -

        Hi Tejas - you're right for (1), it should indeed be host_a.example.org, host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. The reducer should take the domain + suffix as key and then emit the domain if ALL hosts are unknown. If you emit a domain if most but not all hosts are unknown, the DomainBlacklistURLFilter will remove the entire domain from the CrawlDB and WebgraphDB.

        The example for (2) does not include cross-domain redirects but the problem is similar. I think it works fine for now because multi-redirects are not very common on the entire internet.

        A larger problem is the filterNormalize() method. It actually receives a hostname, not a URL. And to pass URL filters we must prepend the URL scheme to make it look like a URL. I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed.

        I think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.

        Show
        Markus Jelsma added a comment - Hi Tejas - you're right for (1), it should indeed be host_a.example.org, host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. The reducer should take the domain + suffix as key and then emit the domain if ALL hosts are unknown. If you emit a domain if most but not all hosts are unknown, the DomainBlacklistURLFilter will remove the entire domain from the CrawlDB and WebgraphDB. The example for (2) does not include cross-domain redirects but the problem is similar. I think it works fine for now because multi-redirects are not very common on the entire internet. A larger problem is the filterNormalize() method. It actually receives a hostname, not a URL. And to pass URL filters we must prepend the URL scheme to make it look like a URL. I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed. I think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.
        Hide
        Tejas Patil added a comment -

        Hi Markus Jelsma,
        The initial patch is good. This feature would be a good addition to nutch
        I did some minor changes to it (NUTCH-1325.trunk.v2.path) mainly to make it work with the current trunk.

        Sorry for bringing this up (after one entire year). Would it be ok if I take this work forward ?

        If "yes", then kindly provide me more details about the stuff in "TODO":
        (1) DumpHostDb class doesnt has a reducer and there was this comment there:

        reduce unknown hosts to single unknown domain if possible. Enable via configuration
        host_a.example.org,host_a.example.org ==> example.org

        In the example, both the hosts were same. Are these ok:

        • host_a.example.org, host_b.example.org ==> example.org
        • x.xyz.org, a.abc.org ==> unknown

        (2) In the UpdateHostDb class, map() method:

        TODO: fix multi redirects: host_a => host_b/page => host_c/page/whatever
        http://www.ferienwohnung-armbruster.de/
        http://www.ferienwohnung-armbruster.de/website/
        http://www.ferienwohnung-armbruster.de/website/willkommen.php
        
        We cannot reresolve redirects for host objects as CrawlDatum metadata is
        not available. We also cannot reliably use the reducer in all cases since
        redirects may be across hosts or even domains. The example above has
        redirects that will end up in the same reducer. During that phase,
        however, we do not know which URL redirects to the next URL.

        The example is not showing the case when the re-directions are across different hosts.

        Show
        Tejas Patil added a comment - Hi Markus Jelsma , The initial patch is good. This feature would be a good addition to nutch I did some minor changes to it ( NUTCH-1325 .trunk.v2.path) mainly to make it work with the current trunk. Sorry for bringing this up (after one entire year). Would it be ok if I take this work forward ? If "yes", then kindly provide me more details about the stuff in "TODO": (1) DumpHostDb class doesnt has a reducer and there was this comment there: reduce unknown hosts to single unknown domain if possible. Enable via configuration host_a.example.org,host_a.example.org ==> example.org In the example, both the hosts were same. Are these ok: host_a.example.org, host_b.example.org ==> example.org x.xyz.org, a.abc.org ==> unknown (2) In the UpdateHostDb class, map() method: TODO: fix multi redirects: host_a => host_b/page => host_c/page/whatever http://www.ferienwohnung-armbruster.de/ http://www.ferienwohnung-armbruster.de/website/ http://www.ferienwohnung-armbruster.de/website/willkommen.php We cannot reresolve redirects for host objects as CrawlDatum metadata is not available. We also cannot reliably use the reducer in all cases since redirects may be across hosts or even domains. The example above has redirects that will end up in the same reducer. During that phase, however, we do not know which URL redirects to the next URL. The example is not showing the case when the re-directions are across different hosts.
        Hide
        Markus Jelsma added a comment -

        Initial patch. This introduces a HostDB that keeps track of host information such as its homepage, CrawlDB statistics, DNS status and allows for metadata to be added. The dump tool can produce output suitable for the DomainBlacklistURLFilter. With it, you can automatically get rid of unknown that pollute your CrawlDB.

        Comments are appreciated as usual!

        Show
        Markus Jelsma added a comment - Initial patch. This introduces a HostDB that keeps track of host information such as its homepage, CrawlDB statistics, DNS status and allows for metadata to be added. The dump tool can produce output suitable for the DomainBlacklistURLFilter. With it, you can automatically get rid of unknown that pollute your CrawlDB. Comments are appreciated as usual!

          People

          • Assignee:
            Tejas Patil
            Reporter:
            Markus Jelsma
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development