Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.3.1, 1.16
-
None
-
Ubuntu 16.04 64-bit
Oracle Java 8 64-bit
Nutch 2.3.1 (standalone deployment)
MongoDB 3.4
Description
My application is trying to perform continuous crawling using Nutch REST services. The application injects a seed URL and then repeats GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step in the sequence is executed upon successful competition of the previous step then the whole sequence is repeated again). Here is a brief description of the job:
- Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
- 'topN' parameter value of GENERATE step in each cycle: 10
- Seed URL: http://www.cnn.com
- Regex URL filters for all jobs:
- "-^.{1000,}$" - exclude very long URLs
- "+." - include the rest
To monitor Nutch server I use Java VisualVM that comes with Java SDK. After each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage collection using the mentioned tool and check memory usage. My observation is that Nutch Server leaks ~25MB per run.
NOTES: I added custom HTTP DELETE services to clean job history in NutchServerPoolExecutor and remove all custom configurations from RAMConfManager after each run. So observed ~25MB memory leak is after job history/configuration cleanup.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1746 OutOfMemoryError in Mappers
- Open