Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-375

Wikipedia Ingest needs more parallelism

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      The wikipedia ingest Map job uses a derivative of the FileInputFormat, which launches one job per file. Given the partitioning strategy and workload distribution, it makes sense to launch multiple mappers per file. Each mapper can then take a chunk of the articles in the file using the same partitioning strategy as the assignment of row IDs.

      Attachments

        Activity

          People

            afuchs Adam Fuchs
            afuchs Adam Fuchs
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: