Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).

      Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance

      I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs.

      We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc.

      We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix).

      I will update this JIRA with more details of current activities soon.

      1. generate_data.pl
        10 kB
        Alan Gates
      2. perf.hadoop.patch
        33 kB
        Ying He
      3. perf.patch
        153 kB
        Alan Gates
      4. perf-0.6.patch
        152 kB
        Daniel Dai
      5. pig-0.8.1-vs-0.9.0.png
        8 kB
        Jie Li
      6. PIG-200-0.12.patch
        218 kB
        Daniel Dai
      7. pigmix_pig0.11.patch
        194 kB
        Dmitriy V. Ryaboy
      8. pigmix2.patch
        200 kB
        Daniel Dai

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Alan Gates
              Reporter:
              Amir Youssefi
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development