Nutch
  1. Nutch
  2. NUTCH-1087

Deprecate crawl command and replace with example script

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.6, 2.2
    • Component/s: None
    • Labels:
      None

      Description

      • remove the crawl command
      • add basic crawl shell script

      See thread:
      http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

      1. NUTCH-1087-2.1-2.patch
        0.9 kB
        Tristan Buckner
      2. NUTCH-1087-2.1.patch
        5 kB
        Julien Nioche
      3. NUTCH-1087-1.6-3.patch
        5 kB
        Julien Nioche
      4. NUTCH-1087-1.6-2.patch
        5 kB
        Markus Jelsma
      5. NUTCH-1087.patch
        5 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          IIRC we had this discussion in the past... It's true that we already rely on Bash to do anything useful, no matter whether it's on Windows or on a *nix-like OS. And it's true that the crawl command has been a constant source of confusion over the years. The crawl application also suffered from some subtle bugs, especially when running in local mode (e.g. the PluginRepository leaks).

          But the argument about maintenance costs is IMHO moot - you have to maintain a shell script, too, so it's no different from maintaining a Java class. Where it differs, I think, is that moving the crawl cycle logic to a shell script now raises the bar for Java developers who are not familiar with Bash scripting - a robust crawl script is not easy to follow, as it needs to handle error conditions and manage input/output resources on HDFS. On the other hand it's easier for system admins to tweak a script rather than tweaking a Java code... so I guess it's also a question of who's the audience for this functionality.

          I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't change the picture in any significant way.

          Show
          Andrzej Bialecki added a comment - IIRC we had this discussion in the past... It's true that we already rely on Bash to do anything useful, no matter whether it's on Windows or on a *nix-like OS. And it's true that the crawl command has been a constant source of confusion over the years. The crawl application also suffered from some subtle bugs, especially when running in local mode (e.g. the PluginRepository leaks). But the argument about maintenance costs is IMHO moot - you have to maintain a shell script, too, so it's no different from maintaining a Java class. Where it differs, I think, is that moving the crawl cycle logic to a shell script now raises the bar for Java developers who are not familiar with Bash scripting - a robust crawl script is not easy to follow, as it needs to handle error conditions and manage input/output resources on HDFS. On the other hand it's easier for system admins to tweak a script rather than tweaking a Java code... so I guess it's also a question of who's the audience for this functionality. I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't change the picture in any significant way.
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Julien Nioche added a comment -

          WORK IN PROGRESS
          Need to add more comments + include the injection, linkd and SOLR steps
          The rest of the script should be fine and should provide a good basis.

          Show
          Julien Nioche added a comment - WORK IN PROGRESS Need to add more comments + include the injection, linkd and SOLR steps The rest of the script should be fine and should provide a good basis.
          Hide
          Julien Nioche added a comment -

          First version of the nutch crawl script. Please test and review

          Show
          Julien Nioche added a comment - First version of the nutch crawl script. Please test and review
          Hide
          Markus Jelsma added a comment -

          Works nicely but it cannot be run from the runtime/local directory. The wiki usually describes commands to be run from there.

          $ bin/crawl urls/ crawl/crawldb http://localhost:8983/solr 2
          bin/crawl: line 89: ./nutch: No such file or directory

          All goes well until invertlinks:

          LinkDb: starting at 2012-07-10 15:09:12
          LinkDb: linkdb: ../crawl/crawldb/linkdb
          LinkDb: URL normalize: true
          LinkDb: URL filter: true
          LinkDb: internal links will be ignored.
          LinkDb: adding segment: 20120710150834
          LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/markus/trunk/runtime/local/bin/20120710150834/parse_data
                  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
                  at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
                  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
                  at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
                  at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
                  at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
                  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
                  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
                  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
                  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
                  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
                  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)
                  at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:295)
                  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                  at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:260)

          I also think 2GB heap space for childs is far too much for common installations.

          Show
          Markus Jelsma added a comment - Works nicely but it cannot be run from the runtime/local directory. The wiki usually describes commands to be run from there. $ bin/crawl urls/ crawl/crawldb http: //localhost:8983/solr 2 bin/crawl: line 89: ./nutch: No such file or directory All goes well until invertlinks: LinkDb: starting at 2012-07-10 15:09:12 LinkDb: linkdb: ../crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: 20120710150834 LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/markus/trunk/runtime/local/bin/20120710150834/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:295) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:260) I also think 2GB heap space for childs is far too much for common installations.
          Hide
          Markus Jelsma added a comment -

          Here's a new patch fixing the invert links command, heap size to 1000m and fixing two log lines.

          Show
          Markus Jelsma added a comment - Here's a new patch fixing the invert links command, heap size to 1000m and fixing two log lines.
          Hide
          Julien Nioche added a comment -

          Good catch Markus. Ideally we'd need to add something to the script so that it determines where the nutch command is located. I'll have a look at that

          Show
          Julien Nioche added a comment - Good catch Markus. Ideally we'd need to add something to the script so that it determines where the nutch command is located. I'll have a look at that
          Hide
          Julien Nioche added a comment -

          The script now determines where the nutch script is located and works when called from the bin dir or outside of it.

          Show
          Julien Nioche added a comment - The script now determines where the nutch script is located and works when called from the bin dir or outside of it.
          Hide
          Julien Nioche added a comment -

          Similar patch for 2.x - NOT TESTED YET

          Show
          Julien Nioche added a comment - Similar patch for 2.x - NOT TESTED YET
          Hide
          Julien Nioche added a comment -

          Trunk : committed revision 1359720.
          2.x => still needs testing

          Show
          Julien Nioche added a comment - Trunk : committed revision 1359720. 2.x => still needs testing
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1893 (See https://builds.apache.org/job/Nutch-trunk/1893/)
          NUTCH-1087 Deprecate crawl command and replace with example script (Revision 1359720)

          Result = SUCCESS
          jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359720
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/bin/crawl
          • /nutch/trunk/src/bin/nutch
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1893 (See https://builds.apache.org/job/Nutch-trunk/1893/ ) NUTCH-1087 Deprecate crawl command and replace with example script (Revision 1359720) Result = SUCCESS jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359720 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/bin/crawl /nutch/trunk/src/bin/nutch
          Hide
          Julien Nioche added a comment -

          Nutch 2-x : Committed revision 1400390.

          Can open a new issue if there are any problems with the script.Should be a good starting point

          Show
          Julien Nioche added a comment - Nutch 2-x : Committed revision 1400390. Can open a new issue if there are any problems with the script.Should be a good starting point
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #385 (See https://builds.apache.org/job/Nutch-nutchgora/385/)
          NUTCH-1087 crawl script (Revision 1400390)

          Result = SUCCESS
          jnioche :
          Files :

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/src/bin/crawl
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #385 (See https://builds.apache.org/job/Nutch-nutchgora/385/ ) NUTCH-1087 crawl script (Revision 1400390) Result = SUCCESS jnioche : Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/bin/crawl
          Hide
          Tristan Buckner added a comment -

          Solr indexing step needed to have the $SEGMENT path fixed as well. Also in local mode sed, on Mac OS at least, doesn't successfully replace spaces with newlines. Changed to awk.

          Show
          Tristan Buckner added a comment - Solr indexing step needed to have the $SEGMENT path fixed as well. Also in local mode sed, on Mac OS at least, doesn't successfully replace spaces with newlines. Changed to awk.
          Hide
          Sebastian Nagel added a comment -

          Hi Tristan,
          thanks for the patch! The segment path of solrindex was already reported in NUTCH-1500
          Can you open a new issue for the Mac OS problem? It's more verbose to separate the problems then reopening resolved issues again. Thanks. Btw., maybe a simple solution

          SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
          

          without sed or awk is preferable. Does it work on Mac OS?

          Show
          Sebastian Nagel added a comment - Hi Tristan, thanks for the patch! The segment path of solrindex was already reported in NUTCH-1500 Can you open a new issue for the Mac OS problem? It's more verbose to separate the problems then reopening resolved issues again. Thanks. Btw., maybe a simple solution SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1` without sed or awk is preferable. Does it work on Mac OS?
          Hide
          Julien Nioche added a comment -

          Hi Sebastian

          SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`

          is not a good option as it won't work in deploy mode, only in local whereas using 'hadoop fs -ls' works in both cases.

          Julien

          Show
          Julien Nioche added a comment - Hi Sebastian SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1` is not a good option as it won't work in deploy mode, only in local whereas using 'hadoop fs -ls' works in both cases. Julien
          Hide
          Sebastian Nagel added a comment -

          yes, of course, but currently there is already a if-else to separate local from distributed mode. But let's move the discussion to a new issue.

          Show
          Sebastian Nagel added a comment - yes, of course, but currently there is already a if-else to separate local from distributed mode. But let's move the discussion to a new issue.
          Hide
          Julien Nioche added a comment -

          Apologies Seb, I should (a) not read emails late in the evening after a long day (b) check the code before commenting

          Show
          Julien Nioche added a comment - Apologies Seb, I should (a) not read emails late in the evening after a long day (b) check the code before commenting

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development