Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2463

Enable sampling CrawlDB

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • None
    • 1.14
    • crawldb
    • None

    Description

      CrawlDB can grow to contain billions of records. When that happens readdb -dump is pretty useless, and readdb -topN can run for ages (and does not provide a statistically correct sample).
      We should add a parameter -sample to readdb -dump which is followed by a number between 0 and 1, and only that fraction of records from the CrawlDB will be processed.
      The sample should be statistically random, and all the other filters should be applied on the sampled records.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yossi Yossi Tamari
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment