Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).

      Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance

      I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs.

      We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc.

      We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix).

      I will update this JIRA with more details of current activities soon.

        Attachments

        1. PIG-200-0.12.patch
          218 kB
          Daniel Dai
        2. pig-0.8.1-vs-0.9.0.png
          8 kB
          Jie Li
        3. pigmix_pig0.11.patch
          194 kB
          Dmitriy V. Ryaboy
        4. pigmix2.patch
          200 kB
          Daniel Dai
        5. perf-0.6.patch
          152 kB
          Daniel Dai
        6. perf.hadoop.patch
          33 kB
          Ying He
        7. perf.patch
          153 kB
          Alan Gates
        8. generate_data.pl
          10 kB
          Alan Gates

          Issue Links

            Activity

              People

              • Assignee:
                alangates Alan Gates
                Reporter:
                amirhyoussefi Amir Youssefi
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: