Pig
  1. Pig
  2. PIG-2397

Running TPC-H Benchmark on Pig

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Tags:
      Pig TPC-H benchmark

      Description

      For a class project we developed a whole set of Pig scripts for TPC-H. Our goals are:

      1) identifying the bottlenecks of Pig's performance especially of its relational operators,

      2) studying how to write efficient scripts by making full use of Pig Latin's features,

      3) comparing with Hive's TPC-H results for verifying both 1) and 2).

      We will update the JIRA with our scripts, results and analysis soon.

      1. pig_tpch.ppt
        341 kB
        Jie Li
      2. TPC-H_on_Pig.tgz
        10 kB
        Jie Li

        Issue Links

          Activity

          Hide
          alex gemini added a comment -

          600 map tasks: 2334 sec and 200 map tasks: 2157 sec,I guess join will not benefit more maps due to it will need read more times small table intermediate result and deserilized it to java memeory object. can you shared hdfs_read_bytes metrics?

          Show
          alex gemini added a comment - 600 map tasks: 2334 sec and 200 map tasks: 2157 sec,I guess join will not benefit more maps due to it will need read more times small table intermediate result and deserilized it to java memeory object. can you shared hdfs_read_bytes metrics?
          Hide
          rim moussar added a comment -

          Hi
          I've sent to you the report by email, et I've uploaded on my homepage (https://sites.google.com/site/rimmoussa/research) pig scripts using the big tpch file

          cheers
          Dr. Rim Moussa

          Show
          rim moussar added a comment - Hi I've sent to you the report by email, et I've uploaded on my homepage ( https://sites.google.com/site/rimmoussa/research ) pig scripts using the big tpch file cheers Dr. Rim Moussa
          Hide
          Jie Li added a comment -

          Thanks very much for letting us know. Sorry that we didn't find that, otherwise we didn't need to write scripts by ourselves and could have spent more time on analysing the results

          Is your report available?

          Show
          Jie Li added a comment - Thanks very much for letting us know. Sorry that we didn't find that, otherwise we didn't need to write scripts by ourselves and could have spent more time on analysing the results Is your report available?
          Hide
          rim moussar added a comment -

          Hi
          please notice that tpch queries were translated into pig latin and are available since july'2011 for download from my homepage https://sites.google.com/site/rimmoussa/research.

          Also for avoiding join operations costs, I've combined all .tbl files into one file;

          cheers,
          Dr. Rim Moussa

          Show
          rim moussar added a comment - Hi please notice that tpch queries were translated into pig latin and are available since july'2011 for download from my homepage https://sites.google.com/site/rimmoussa/research . Also for avoiding join operations costs, I've combined all .tbl files into one file; cheers, Dr. Rim Moussa
          Hide
          Thejas M Nair added a comment -

          When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records.

          MAX_SIZE_CURVAL_CACHE limits the number of values held in memory for a particular group-key. Once that limit is hit or a new group-key is seen, it aggregates the values for that key and stores the result back in the hash-map. That does not trigger dumping to disk. Are you saying that you got 28 output records from a single map, though there were only 4 unique group-keys ? I expect only 4 output records from a single map, because the hashmap with 4 entries should easily fit in memory. If that is the case, I need to check why that might be happening.

          Show
          Thejas M Nair added a comment - When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records. MAX_SIZE_CURVAL_CACHE limits the number of values held in memory for a particular group-key. Once that limit is hit or a new group-key is seen, it aggregates the values for that key and stores the result back in the hash-map. That does not trigger dumping to disk. Are you saying that you got 28 output records from a single map, though there were only 4 unique group-keys ? I expect only 4 output records from a single map, because the hashmap with 4 entries should easily fit in memory. If that is the case, I need to check why that might be happening.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Interesting, tasks take much longer (10 secs vs your ~5s indicated by the numbers) to set up on my cluster, probably due to overall pressure on the JT. Thanks for running the numbers for Hive.

          I'll profile a few map tasks on Q1 to see where it's spending time.

          Show
          Dmitriy V. Ryaboy added a comment - Interesting, tasks take much longer (10 secs vs your ~5s indicated by the numbers) to set up on my cluster, probably due to overall pressure on the JT. Thanks for running the numbers for Hive. I'll profile a few map tasks on Q1 to see where it's spending time.
          Hide
          Jie Li added a comment -

          Here is a comparison of Hive's Q1 over 100GB:

          600 map tasks: 2334 sec
          200 map tasks: 2157 sec

          Show
          Jie Li added a comment - Here is a comparison of Hive's Q1 over 100GB: 600 map tasks: 2334 sec 200 map tasks: 2157 sec
          Hide
          Jie Li added a comment -

          The variability in my numbers is pretty much completely due to delays in task scheduling (busy cluster).

          I see. We used a dedicated cluster (though it was on Amazon EC2).

          I find it hard to imagine that changing the split size by 8x didn't affect Hive performance

          For Q1 over 100GB data, the table lineitem consists of 600 HDFS blocks (our default block size is 128MB), so 600 map tasks need 40 waves in our cluster (16 map slots). If each task takes 2 seconds to set up, the total task setup time is 80 seconds. Compared to Hive's 2300 seconds it can be ignored.

          Show
          Jie Li added a comment - The variability in my numbers is pretty much completely due to delays in task scheduling (busy cluster). I see. We used a dedicated cluster (though it was on Amazon EC2). I find it hard to imagine that changing the split size by 8x didn't affect Hive performance For Q1 over 100GB data, the table lineitem consists of 600 HDFS blocks (our default block size is 128MB), so 600 map tasks need 40 waves in our cluster (16 map slots). If each task takes 2 seconds to set up, the total task setup time is 80 seconds. Compared to Hive's 2300 seconds it can be ignored.
          Hide
          Dmitriy V. Ryaboy added a comment -

          The variability in my numbers is pretty much completely due to delays in task scheduling (busy cluster). The amount of time the tasks themselves took stayed quite consistent. That's part of my point about the fallacy of just measuring elapsed wall-clock time as a single metric. You have to account for degree of parallelism and separate out framework overhead from time spent on work.

          I find it hard to imagine that changing the split size by 8x didn't affect Hive performance on an under-provisioned cluster (meaning, in this case, a cluster that can't run all the tasks in parallel simultaneously). Perhaps Hive is doing some task aggregation for you under the covers. What was the number of map tasks it spun up? What was the number of tasks Pig spun up?

          Show
          Dmitriy V. Ryaboy added a comment - The variability in my numbers is pretty much completely due to delays in task scheduling (busy cluster). The amount of time the tasks themselves took stayed quite consistent. That's part of my point about the fallacy of just measuring elapsed wall-clock time as a single metric. You have to account for degree of parallelism and separate out framework overhead from time spent on work. I find it hard to imagine that changing the split size by 8x didn't affect Hive performance on an under-provisioned cluster (meaning, in this case, a cluster that can't run all the tasks in parallel simultaneously). Perhaps Hive is doing some task aggregation for you under the covers. What was the number of map tasks it spun up? What was the number of tasks Pig spun up?
          Hide
          Jie Li added a comment -

          Thanks Dmitriy for your analysis.

          Hive ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several runs of Q1.

          Seems your results were not stable. Most of our benchmark results have a variance within 5%.

          1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not.

          Yeah we should have mentioned this. Actually we removed this parameter in Hive's query file, and we found it didn't make much difference to Hive. I agree that on this case it might be better to have less number of map tasks, as we don't need to overlap them with the reduce tasks. For our project we didn't optimize the configuration.

          2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is set to 1

          Yeah it would be nice to dynamically decide the sampling job.

          Show
          Jie Li added a comment - Thanks Dmitriy for your analysis. Hive ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several runs of Q1. Seems your results were not stable. Most of our benchmark results have a variance within 5%. 1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not. Yeah we should have mentioned this. Actually we removed this parameter in Hive's query file, and we found it didn't make much difference to Hive. I agree that on this case it might be better to have less number of map tasks, as we don't need to overlap them with the reduce tasks. For our project we didn't optimize the configuration. 2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is set to 1 Yeah it would be nice to dynamically decide the sampling job.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I was curious about the massive difference between what Jie was seeing for Hive and Pig on Q1, and did a little digging of my own.
          I couldn't get the same difference in performance out of the box at all on my cluster – Hive ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several runs of Q1.

          Digging in a little further, I think there are 3 things worth noting:
          1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not. This means Pig will pick up whatever the cluster defaults are, and the difference in # of mappers will be greatly exaggerated when running on small clusters incapable of running hundreds of tasks in parallel (task set-up costs will keep accumulating). I recommend this parameter be set to be the same as the one in Hive TPC-H in PIG-2397, for consistency.

          2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is set to 1 (so sampling and custom partitioning is useless). That's just free performance gains, and comes up in many real-life cases, not just benchmarks. We should fix this and get 30 seconds per job back.

          3) When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records.

          D

          Show
          Dmitriy V. Ryaboy added a comment - I was curious about the massive difference between what Jie was seeing for Hive and Pig on Q1, and did a little digging of my own. I couldn't get the same difference in performance out of the box at all on my cluster – Hive ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several runs of Q1. Digging in a little further, I think there are 3 things worth noting: 1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not. This means Pig will pick up whatever the cluster defaults are, and the difference in # of mappers will be greatly exaggerated when running on small clusters incapable of running hundreds of tasks in parallel (task set-up costs will keep accumulating). I recommend this parameter be set to be the same as the one in Hive TPC-H in PIG-2397 , for consistency. 2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is set to 1 (so sampling and custom partitioning is useless). That's just free performance gains, and comes up in many real-life cases, not just benchmarks. We should fix this and get 30 seconds per job back. 3) When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records. D
          Hide
          Jie Li added a comment -

          Dmitriy, thanks for your comments. As I mentioned in [PIG-2228}https://issues.apache.org/jira/browse/PIG-2228], Q1's group-by has only four different groups, so the order-by was just sorting four groups of numbers. We can assume the group-by took most of the time.

          Show
          Jie Li added a comment - Dmitriy, thanks for your comments. As I mentioned in [PIG-2228}https://issues.apache.org/jira/browse/PIG-2228] , Q1's group-by has only four different groups, so the order-by was just sorting four groups of numbers. We can assume the group-by took most of the time.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jie, something just occurred to me about Q1 – are you sure Hive is doing the right thing here?
          If it's using more than 1 reducer for the ORDER operation, it's not:

          Syntax of Order By
          The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
          
          colOrder: ( ASC | DESC )
          orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
          query: SELECT expression (',' expression)* FROM src orderBy
          There are some limitations in the "order by" clause. In the strict mode 
          (i.e., hive.mapred.mode=strict), the order by clause has to be followed 
          by a "limit" clause. The limit clause is not necessary if you set 
          hive.mapred.mode to nonstrict. The reason is that in order to impose 
          total order of all results, there has to be one reducer to sort the final 
          output. If the number of rows in the output is too large, the single reducer 
          could take a very long time to finish.
          
          Show
          Dmitriy V. Ryaboy added a comment - Jie, something just occurred to me about Q1 – are you sure Hive is doing the right thing here? If it's using more than 1 reducer for the ORDER operation, it's not: Syntax of Order By The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language. colOrder: ( ASC | DESC ) orderBy: ORDER BY colName colOrder? (',' colName colOrder?)* query: SELECT expression (',' expression)* FROM src orderBy There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
          Hide
          Jie Li added a comment -

          I had a quick test of 0.10 branch for the hash-based agg, using the simplest two queries Q1 and Q6. I'll update the result in PIG-2228.

          Show
          Jie Li added a comment - I had a quick test of 0.10 branch for the hash-based agg, using the simplest two queries Q1 and Q6. I'll update the result in PIG-2228 .
          Hide
          Thejas M Nair added a comment -

          Created PIG-2423 to update pig documentation for use of co-group instead of join.

          Show
          Thejas M Nair added a comment - Created PIG-2423 to update pig documentation for use of co-group instead of join.
          Hide
          Thejas M Nair added a comment -

          Jie,
          Thanks for doing this benchmark and providing the analysis of the factors that affect the overall performance. This is very useful.

          One of the reasons for creating a new language for pig was to enable users to express more optimal query plan query itself. It enables you to express the optimizations mentioned in 2,3,4 and 5 in language itself. This is very useful feature of pig, because even if the optimizer is very good, there will be cases where it does the wrong thing. Also, it will be some time before a good cost based optimizer is going to be available for pig.

          As you mention, pig now only has a rule based optimizer, which runs the rules that should improve performance in almost all cases. The rules 2 - 5 that you mention should improve performance in almost all cases, so it makes sense to implement those rules in pig.

          Regarding 5, the work on lazy de-serialization done in PIG-2359 is going to be useful.

          Regarding optimization of join followed by group-by, even if the join and group keys are different, hash-based aggregation can be used to reduce the size of output written to HDFS from the MR job for join, by doing the partial aggregation the reduce.
          For the case where a join and group have same keys, the pig optimizer re-writing the query into a co-group operation might be easiest thing to do.

          I don't think the optimization tips 2,3 are there in the pig documentation, it makes sense to document these. I will open another jira to address that.

          I am really looking forward to see the results with pig 0.10 branch (http://svn.apache.org/repos/asf/pig/branches/branch-0.10) (with hash-based agg enabled).

          Show
          Thejas M Nair added a comment - Jie, Thanks for doing this benchmark and providing the analysis of the factors that affect the overall performance. This is very useful. One of the reasons for creating a new language for pig was to enable users to express more optimal query plan query itself. It enables you to express the optimizations mentioned in 2,3,4 and 5 in language itself. This is very useful feature of pig, because even if the optimizer is very good, there will be cases where it does the wrong thing. Also, it will be some time before a good cost based optimizer is going to be available for pig. As you mention, pig now only has a rule based optimizer, which runs the rules that should improve performance in almost all cases. The rules 2 - 5 that you mention should improve performance in almost all cases, so it makes sense to implement those rules in pig. Regarding 5, the work on lazy de-serialization done in PIG-2359 is going to be useful. Regarding optimization of join followed by group-by, even if the join and group keys are different, hash-based aggregation can be used to reduce the size of output written to HDFS from the MR job for join, by doing the partial aggregation the reduce. For the case where a join and group have same keys, the pig optimizer re-writing the query into a co-group operation might be easiest thing to do. I don't think the optimization tips 2,3 are there in the pig documentation, it makes sense to document these. I will open another jira to address that. I am really looking forward to see the results with pig 0.10 branch ( http://svn.apache.org/repos/asf/pig/branches/branch-0.10 ) (with hash-based agg enabled).
          Hide
          Jie Li added a comment -

          As we summarized in the slides, there are many factors affecting the overall performance:

          1. Reorder JOINs properly
          Not sure if Pig is moving on to a cost-based optimizer that figures out a better join order/type.

          2&3. Multi query optimization, e.g. using COGROUP for JOIN + GROUP && using FLATTEN for self-join + GROUP
          First, can such rules can be implemented as query rewriting rules? Seems they are always better.
          Second, without query rewriting, it's still possible to take advantage of the common key between join/group. Hive is taking such efforts now (https://issues.apache.org/jira/browse/HIVE-1772).

          4. Project before (CO)GROUP (https://issues.apache.org/jira/browse/PIG-1324)
          This issue will affect the performance of GROUP/COGROUP. Currently users have to explicitly drop redundant columns before (CO)GROUP, which is not convenient and not easy to be done completely.

          5. Remove redundant types in LOAD (https://issues.apache.org/jira/browse/PIG-410)
          Again not convenient for users. This should be done by Pig itself. For columns with types, is pig possible to delay the conversion?

          6. hash-based aggregation
          This feature should significantly improve the group-by. Not sure it can be further used for multi-query optimization. For example, for X (whatever job) + group-by, we can use a hash-based combiner at the end of X, as Hive does already.

          Except the first factor (reorder joins) which Hive is unable as well, all the other factors can contribute to Hive's outperformance. Glad to see that Pig has already implemented the hash-based aggregation. We're excited to repeat the benchmark with this novel feature.

          Show
          Jie Li added a comment - As we summarized in the slides, there are many factors affecting the overall performance: 1. Reorder JOINs properly Not sure if Pig is moving on to a cost-based optimizer that figures out a better join order/type. 2&3. Multi query optimization, e.g. using COGROUP for JOIN + GROUP && using FLATTEN for self-join + GROUP First, can such rules can be implemented as query rewriting rules? Seems they are always better. Second, without query rewriting, it's still possible to take advantage of the common key between join/group. Hive is taking such efforts now ( https://issues.apache.org/jira/browse/HIVE-1772 ). 4. Project before (CO)GROUP ( https://issues.apache.org/jira/browse/PIG-1324 ) This issue will affect the performance of GROUP/COGROUP. Currently users have to explicitly drop redundant columns before (CO)GROUP, which is not convenient and not easy to be done completely. 5. Remove redundant types in LOAD ( https://issues.apache.org/jira/browse/PIG-410 ) Again not convenient for users. This should be done by Pig itself. For columns with types, is pig possible to delay the conversion? 6. hash-based aggregation This feature should significantly improve the group-by. Not sure it can be further used for multi-query optimization. For example, for X (whatever job) + group-by, we can use a hash-based combiner at the end of X, as Hive does already. Except the first factor (reorder joins) which Hive is unable as well, all the other factors can contribute to Hive's outperformance. Glad to see that Pig has already implemented the hash-based aggregation. We're excited to repeat the benchmark with this novel feature.
          Hide
          Thejas M Nair added a comment -

          Most of the queries where hive outperformed pig are sql group-by queries. With map side hash based aggregation added in PIG-2228, group-by queries see improvements upto 50% . Can you try these queries with the pig 0.10 branch and the pig.exec.mapPartAgg property set to true ?

          Show
          Thejas M Nair added a comment - Most of the queries where hive outperformed pig are sql group-by queries. With map side hash based aggregation added in PIG-2228 , group-by queries see improvements upto 50% . Can you try these queries with the pig 0.10 branch and the pig.exec.mapPartAgg property set to true ?
          Hide
          Jie Li added a comment -

          We update the slides and the scripts we used. Any comment is appreciated. We will post our report later.

          Show
          Jie Li added a comment - We update the slides and the scripts we used. Any comment is appreciated. We will post our report later.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jie Li
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development