Details
Description
Am happily trying to crawl a few hundred URLs incrementally. Performance is degrading suddenly after the index reaches approximately 25000 URLs.
At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, solrindex, solrdedup batch takes approximately half an hour with topN 500, but elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch. As I'm uncertain which of the phases takes so much time I decided to add start and finish times to al classes that implement Tool so I at least have a feeling and can review them in a log file.
Am using pretty old hardware, but I am planning to recrawl these URLs on a regular basis and if every iteration is going to take more and more time, index updates will be few and far between
I added timing information to all Tool classes for consistency whereas there are only 10 or so Tools that are really interesting.
Attachments
Attachments
Issue Links
- is depended upon by
-
NUTCH-697 Generate log output for solr indexer and dedup
- Closed