|
I think this is the same problem as a couple of other Nutch issues already in Jira
I have the same problem when running the crawling functionality of Nutch 7.0 in WAS5.1 using IBM JDK 1.4 . But it runs very well at tomcat 5.0.28 using Sun JDK 1.4 .
Exception in thread "main" java.io.IOException: already exists: %CRAWL_RESULT_HOME%\db\webdb.new\pagesByURL I check and debug the code and find these relative files have not be release when delete these files. I have same issue in my new production system, although same code works on dev and old production without any problems.
The solution for this bug is uncommenting "pageDb.close();" in the WebDBWriter.java file. Otherwise the reader locks the webdb.new\pagesByURL\data file and it cannot be deleted sometimes. Applied fixed by Mike. Also reported offlist by Michal Karwanski.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Based on this, I have come to the above conslusion that it is some sort of file handle race condition - also, for those of you wondering, it did not matter if I was running 1 or 30 threads, I had the same problems.
Hope this helps a little.