Nutch
  1. Nutch
  2. NUTCH-159

Specify temp/working directory for crawl

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: fetcher, indexer
    • Labels:
      None
    • Environment:

      Linux/Debian

      Description

      I ran a crawl of 100k web pages and got:

      org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
      at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
      at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
      at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
      at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
      at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
      Caused by: java.io.IOException: No space left on device
      at java.io.FileOutputStream.writeBytes(Native Method)
      at java.io.FileOutputStream.write(FileOutputStream.java:260)
      at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
      ... 4 more
      Exception in thread "main" java.io.IOException: Job failed!
      at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
      at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
      at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
      byron@db02:/data/nutch$ df -k

      It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.

      Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.

        Activity

        Hide
        Andrzej Bialecki added a comment -

        No longer applicable (moved to Hadoop)

        Show
        Andrzej Bialecki added a comment - No longer applicable (moved to Hadoop)
        Hide
        Paul Baclace added a comment -

        mapred.temp.dir and mapred.local.dir are used for different purposes.

        I think this is a sysadmin useability bug that really means:

        1. defaults for these settings should be documented (of course)
        2. it should be clear whether a path is abstract (applies to NDFS or local FS depending on fs.default.name) or local FS only, or NDFS-only (if any). Config attribute names should consistently indicate this.
        2. some clues as to how much space might be needed (some of this is in transition, however).
        3. when the space is exhausted, the error message should indicate the path(s) in question and config param that is used to specify it.

        Separately, I am preparing a patch that will do (3) for mapred.local.dir

        Show
        Paul Baclace added a comment - mapred.temp.dir and mapred.local.dir are used for different purposes. I think this is a sysadmin useability bug that really means: 1. defaults for these settings should be documented (of course) 2. it should be clear whether a path is abstract (applies to NDFS or local FS depending on fs.default.name) or local FS only, or NDFS-only (if any). Config attribute names should consistently indicate this. 2. some clues as to how much space might be needed (some of this is in transition, however). 3. when the space is exhausted, the error message should indicate the path(s) in question and config param that is used to specify it. Separately, I am preparing a patch that will do (3) for mapred.local.dir
        Hide
        byron miller added a comment -

        While it's from the mapred trunk, it is a non ndfs/local instance only. Mapred.temp.dir was left at it's defaults.. (which didn't exist)

        <property>
        <name>mapred.temp.dir</name>
        <value>/tmp/nutch/mapred/temp</value>
        <description>A shared directory for temporary files.
        </description>
        </property>

        I'm going to modify this and re-run my fetch and let you know how that works.

        Show
        byron miller added a comment - While it's from the mapred trunk, it is a non ndfs/local instance only. Mapred.temp.dir was left at it's defaults.. (which didn't exist) <property> <name>mapred.temp.dir</name> <value>/tmp/nutch/mapred/temp</value> <description>A shared directory for temporary files. </description> </property> I'm going to modify this and re-run my fetch and let you know how that works.
        Hide
        Doug Cutting added a comment -

        mapred.local.dir is the thing to set. if that fails, then there is a bug. what did you have this set to?

        Show
        Doug Cutting added a comment - mapred.local.dir is the thing to set. if that fails, then there is a bug. what did you have this set to?

          People

          • Assignee:
            Unassigned
            Reporter:
            byron miller
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development