Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-159

Specify temp/working directory for crawl

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.8
    • 0.8
    • fetcher, indexer
    • None
    • Linux/Debian

    Description

      I ran a crawl of 100k web pages and got:

      org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
      at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
      at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
      at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
      at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
      at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
      Caused by: java.io.IOException: No space left on device
      at java.io.FileOutputStream.writeBytes(Native Method)
      at java.io.FileOutputStream.write(FileOutputStream.java:260)
      at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
      ... 4 more
      Exception in thread "main" java.io.IOException: Job failed!
      at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
      at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
      at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
      byron@db02:/data/nutch$ df -k

      It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.

      Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.

      Attachments

        Activity

          People

            Unassigned Unassigned
            byronm byron miller
            Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: