Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-117

Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.6, 0.7, 0.7.1
    • Fix Version/s: 0.7.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      Window 2000 P4 1.70GHz 512MB RAM
      Java 1.5.0_05

      Description

      I started a crawl using the command line using nutch 0.7.1.

      nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20

      After crawling for over 15 hours the crawl crached with the following exception:

      051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms
      051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
      051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
      051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438
      051019 050544 Processing document 0
      051019 050544 Finishing update
      051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
      051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
      Exception in thread "main" java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
      at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
      at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
      at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
      at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
      at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
      at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)

      This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6.

        Activity

        Hide
        lokkju Nick Jacobsen added a comment -

        I had a similar issue, and it seems (guessing here) to be related to some sort of race condition on filehandles. I was running the nutch crawler while doing some heavy processing (compiling java on OpenBSD in a virtual machine), and 19 out of 20 times, nutch would crash with that or a similar error - always related to some sort of file not found, and sometimes access denied. As soon as I stopped doing heavy processing, my nutch errors went down to 1 out of every 20 runs.

        Based on this, I have come to the above conslusion that it is some sort of file handle race condition - also, for those of you wondering, it did not matter if I was running 1 or 30 threads, I had the same problems.

        Hope this helps a little.

        Show
        lokkju Nick Jacobsen added a comment - I had a similar issue, and it seems (guessing here) to be related to some sort of race condition on filehandles. I was running the nutch crawler while doing some heavy processing (compiling java on OpenBSD in a virtual machine), and 19 out of 20 times, nutch would crash with that or a similar error - always related to some sort of file not found, and sometimes access denied. As soon as I stopped doing heavy processing, my nutch errors went down to 1 out of every 20 runs. Based on this, I have come to the above conslusion that it is some sort of file handle race condition - also, for those of you wondering, it did not matter if I was running 1 or 30 threads, I had the same problems. Hope this helps a little.
        Hide
        scross Stephen Cross added a comment -

        I think this is the same problem as a couple of other Nutch issues already in Jira

        NUTCH-94: MapFile.Writer throwing 'File exists error'
        http://issues.apache.org/jira/browse/NUTCH-94

        NUTCH-96: MapFile.Writer throws directory exists exception if run multiple times in the same JVM or server JVM
        http://issues.apache.org/jira/browse/NUTCH-96

        Show
        scross Stephen Cross added a comment - I think this is the same problem as a couple of other Nutch issues already in Jira NUTCH-94 : MapFile.Writer throwing 'File exists error' http://issues.apache.org/jira/browse/NUTCH-94 NUTCH-96 : MapFile.Writer throws directory exists exception if run multiple times in the same JVM or server JVM http://issues.apache.org/jira/browse/NUTCH-96
        Hide
        hk200 Spike Wang added a comment -

        I have the same problem when running the crawling functionality of Nutch 7.0 in WAS5.1 using IBM JDK 1.4 . But it runs very well at tomcat 5.0.28 using Sun JDK 1.4 .

        Exception in thread "main" java.io.IOException: already exists: %CRAWL_RESULT_HOME%\db\webdb.new\pagesByURL
        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
        at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
        at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
        at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)

        I check and debug the code and find these relative files have not be release when delete these files.
        I may be add a evaluated patch to solve this problem .

        Show
        hk200 Spike Wang added a comment - I have the same problem when running the crawling functionality of Nutch 7.0 in WAS5.1 using IBM JDK 1.4 . But it runs very well at tomcat 5.0.28 using Sun JDK 1.4 . Exception in thread "main" java.io.IOException: already exists: %CRAWL_RESULT_HOME%\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) I check and debug the code and find these relative files have not be release when delete these files. I may be add a evaluated patch to solve this problem .
        Hide
        malulin Mike Alulin added a comment -

        I have same issue in my new production system, although same code works on dev and old production without any problems.

        The solution for this bug is uncommenting "pageDb.close();" in the WebDBWriter.java file. Otherwise the reader locks the webdb.new\pagesByURL\data file and it cannot be deleted sometimes.

        Show
        malulin Mike Alulin added a comment - I have same issue in my new production system, although same code works on dev and old production without any problems. The solution for this bug is uncommenting "pageDb.close();" in the WebDBWriter.java file. Otherwise the reader locks the webdb.new\pagesByURL\data file and it cannot be deleted sometimes.
        Hide
        pkosiorowski Piotr Kosiorowski added a comment -

        Applied fixed by Mike. Also reported offlist by Michal Karwanski.

        Show
        pkosiorowski Piotr Kosiorowski added a comment - Applied fixed by Mike. Also reported offlist by Michal Karwanski.

          People

          • Assignee:
            pkosiorowski Piotr Kosiorowski
            Reporter:
            scross Stephen Cross
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development