Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: crawldb
    • Labels:
      None

      Description

      CrawlDB can grow to contain billions of records. When that happens readdb -dump is pretty useless, and readdb -topN can run for ages (and does not provide a statistically correct sample).
      We should add a parameter -sample to readdb -dump which is followed by a number between 0 and 1, and only that fraction of records from the CrawlDB will be processed.
      The sample should be statistically random, and all the other filters should be applied on the sampled records.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yossi Yossi Tamari
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: