Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-838

Add timing information to all Tool classes

    XMLWordPrintableJSON

Details

    • Patch Available

    Description

      Am happily trying to crawl a few hundred URLs incrementally. Performance is degrading suddenly after the index reaches approximately 25000 URLs.

      At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, solrindex, solrdedup batch takes approximately half an hour with topN 500, but elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch. As I'm uncertain which of the phases takes so much time I decided to add start and finish times to al classes that implement Tool so I at least have a feeling and can review them in a log file.

      Am using pretty old hardware, but I am planning to recrawl these URLs on a regular basis and if every iteration is going to take more and more time, index updates will be few and far between

      I added timing information to all Tool classes for consistency whereas there are only 10 or so Tools that are really interesting.

      Attachments

        1. timings.patch
          63 kB
          Jeroen van Vianen

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              jeroenv Jeroen van Vianen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: