Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We need some performance benchmark to measure and track the performance improvements of Hive.

      Some references:
      PIG performance benchmarks PIG-200
      PigMix: http://wiki.apache.org/pig/PigMix

      1. AlansMRcode.tgz
        2 kB
        Alan Gates
      2. hive_benchmark_2009-07-21.tar.gz
        258 kB
        Yuntao Jia
      3. hive_benchmark_2009-07-12.pdf
        298 kB
        Yuntao Jia
      4. hive_benchmark_2009-06-18.pdf
        61 kB
        Zheng Shao
      5. hive_benchmark_2009-06-18.tar.gz
        430 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          dd added a comment -

          Hi

          I'm tying to run the mapreduce version of the ranking_select job of the hive-benchmarks, and I'm getting a java error (see below).
          I thought that the error comes from the ranking data but the rankings_uservisits_join job runs without any problem.

          Any help would be appreciated.

          java.lang.NumberFormatException: For input string: ""
          at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
          at java.lang.Integer.parseInt(Integer.java:470)
          at java.lang.Integer.valueOf(Integer.java:554)
          at edu.brown.cs.mapreduce.benchmarks.Benchmark1$TextMap.map(Benchmark1.java:76)
          at edu.brown.cs.mapreduce.benchmarks.Benchmark1$TextMap.map(Benchmark1.java:70)
          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
          at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
          at org.apache.hadoop.mapred.Child.main(Child.java:170)

          Show
          dd added a comment - Hi I'm tying to run the mapreduce version of the ranking_select job of the hive-benchmarks, and I'm getting a java error (see below). I thought that the error comes from the ranking data but the rankings_uservisits_join job runs without any problem. Any help would be appreciated. java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:470) at java.lang.Integer.valueOf(Integer.java:554) at edu.brown.cs.mapreduce.benchmarks.Benchmark1$TextMap.map(Benchmark1.java:76) at edu.brown.cs.mapreduce.benchmarks.Benchmark1$TextMap.map(Benchmark1.java:70) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
          Hide
          Jongse Park added a comment -

          Hi

          I’m trying to use hive performance benchmark.
          I read README file and followed step by step, and finally I reached to the
          data generation part. Teragen has no problem, but I face a problem with htmlgen.

          I configured something to my own cluster (40 VMs), and launched generateData.py,
          but it never finished! I launched it about 10 hours ago..
          So, I modified UserVisits and Rankings in config.txt to be smaller, but it was also failed.
          Even it consumed very little resources when I saw 'top' information in the ubuntu.

          Have you guys ever met this kinds of problems?
          And if you have, how did you solve this problem?
          Thank you so much

          Show
          Jongse Park added a comment - Hi I’m trying to use hive performance benchmark. I read README file and followed step by step, and finally I reached to the data generation part. Teragen has no problem, but I face a problem with htmlgen. I configured something to my own cluster (40 VMs), and launched generateData.py, but it never finished! I launched it about 10 hours ago.. So, I modified UserVisits and Rankings in config.txt to be smaller, but it was also failed. Even it consumed very little resources when I saw 'top' information in the ubuntu. Have you guys ever met this kinds of problems? And if you have, how did you solve this problem? Thank you so much
          Hide
          Vasilis Liaskovitis added a comment -

          I am trying to run the hive benchmarks on a small 8-node hadoop cluster, starting from the 2009-07-21 tarball. Iam using hadoop-0.20.1 and hive-0.4.1. I have a few questions, any help is welcome

          • datasize: the report states 60GB of data used for the uservisitis table. The default datagen/htmlgen/config.txt uses ~15GB of Uservisits tables per node. I believe the way the generatedata.py script runs, we get one rankings/ uservisits table per node i.e. on an 10 data nodes we 'd get 10tablesx15GB=150GB of data. Do the benchmarks in the report use all of the generated data i.e. UserVistis_0, UserVists_1,..., UserVistis_10 across the 10 data nodes? Or a subset of these? Same question for the rankings table.
          • I assume that the hadoop-site.xml and hive-site.xml files under confs/ (from the 2009-07-21 tarball) accurately reflect the hadoop/hive configuration used to produce the results in the report. Let me know if that's not the case, I ve been using these configs to run the benchmarks.
          • Regarding query4 (join) specifically , the reduce phase of the first job (main job with ) shows a strange behaviour. Looking at the jobtracker GUI, out of 60 reducers, 59 reducers finish relatively fast (~1 min), but one reducer task takes much longer (~19min). Does that point to a specific hadoop/ hive misconfiguration? No other hadoop job that I 've run exhibits this. Is this intuitive behaviour given a join operation?
          Show
          Vasilis Liaskovitis added a comment - I am trying to run the hive benchmarks on a small 8-node hadoop cluster, starting from the 2009-07-21 tarball. Iam using hadoop-0.20.1 and hive-0.4.1. I have a few questions, any help is welcome datasize: the report states 60GB of data used for the uservisitis table. The default datagen/htmlgen/config.txt uses ~15GB of Uservisits tables per node. I believe the way the generatedata.py script runs, we get one rankings/ uservisits table per node i.e. on an 10 data nodes we 'd get 10tablesx15GB=150GB of data. Do the benchmarks in the report use all of the generated data i.e. UserVistis_0, UserVists_1,..., UserVistis_10 across the 10 data nodes? Or a subset of these? Same question for the rankings table. I assume that the hadoop-site.xml and hive-site.xml files under confs/ (from the 2009-07-21 tarball) accurately reflect the hadoop/hive configuration used to produce the results in the report. Let me know if that's not the case, I ve been using these configs to run the benchmarks. Regarding query4 (join) specifically , the reduce phase of the first job (main job with ) shows a strange behaviour. Looking at the jobtracker GUI, out of 60 reducers, 59 reducers finish relatively fast (~1 min), but one reducer task takes much longer (~19min). Does that point to a specific hadoop/ hive misconfiguration? No other hadoop job that I 've run exhibits this. Is this intuitive behaviour given a join operation?
          Hide
          Zheng Shao added a comment -

          Alan, did you run your benchmark with the new option we suggested? Also you might want to update hive to trunk to take advantage of new hive performance improvements introduced by HIVE-732.

          Show
          Zheng Shao added a comment - Alan, did you run your benchmark with the new option we suggested? Also you might want to update hive to trunk to take advantage of new hive performance improvements introduced by HIVE-732 .
          Hide
          Namit Jain added a comment -

          > I looked in hive-default.xml and didn't see any hive.merge.mapfiles. Should I add it to hive-defult.xml and set it to false?

          YES

          > Out of curiosity, why do you default to merging map files first?

          There are production map-only jobs which lead to a lot of small files for many partitions, which increases the load on the name node, and too many mappers for processing the data across many partitions.

          Show
          Namit Jain added a comment - > I looked in hive-default.xml and didn't see any hive.merge.mapfiles. Should I add it to hive-defult.xml and set it to false? YES > Out of curiosity, why do you default to merging map files first? There are production map-only jobs which lead to a lot of small files for many partitions, which increases the load on the name node, and too many mappers for processing the data across many partitions.
          Alan Gates made changes -
          Attachment AlansMRcode.tgz [ 12416097 ]
          Hide
          Alan Gates added a comment -

          > How many mapper slots and reducer slots are there in the cluster?
          There are 36 mapper and 36 reducer slots on the cluster.

          > How many mappers and reducers did hadoop, hive and pig take?
          Hadoop and Hive took 35 maps, pig took 36. I set all to use 4 reducers.

          > Are you using hive trunk? What is the hive svn revision number?
          SVN revision 796069

          > I am also interested in learning how you write the efficient hadoop code for the aggregation query. Can you attach your hadoop code?
          Attached as AlansMRCode.tgz

          I looked in hive-default.xml and didn't see any hive.merge.mapfiles. Should I add it to hive-defult.xml and set it to false? Out of curiosity, why do you default to merging map files first?

          Show
          Alan Gates added a comment - > How many mapper slots and reducer slots are there in the cluster? There are 36 mapper and 36 reducer slots on the cluster. > How many mappers and reducers did hadoop, hive and pig take? Hadoop and Hive took 35 maps, pig took 36. I set all to use 4 reducers. > Are you using hive trunk? What is the hive svn revision number? SVN revision 796069 > I am also interested in learning how you write the efficient hadoop code for the aggregation query. Can you attach your hadoop code? Attached as AlansMRCode.tgz I looked in hive-default.xml and didn't see any hive.merge.mapfiles. Should I add it to hive-defult.xml and set it to false? Out of curiosity, why do you default to merging map files first?
          Hide
          Yuntao Jia added a comment -

          Can you check how many jobs are there in the Hive query? If there are two, it means the output from Hive are merged to a single file by running an additional map-reduce job. If that is the case, you can turn it off by change the following property in Hive-default.xml to FALSE (it is true by default).

          hive.merge.mapfiles true
          "Merge small files at the end of a map-only job"

          Other than that, I have no idea why Hive is so slow.

          Show
          Yuntao Jia added a comment - Can you check how many jobs are there in the Hive query? If there are two, it means the output from Hive are merged to a single file by running an additional map-reduce job. If that is the case, you can turn it off by change the following property in Hive-default.xml to FALSE (it is true by default). hive.merge.mapfiles true "Merge small files at the end of a map-only job" Other than that, I have no idea why Hive is so slow.
          Hide
          Zheng Shao added a comment -

          Yuntao is on vacation right now - he will come back next week when he can answer questions on these better (hive configuration, etc).

          How many mapper slots and reducer slots are there in the cluster? How many mappers and reducers did hadoop, hive and pig take?
          Are you using hive trunk? What is the hive svn revision number?
          I am also interested in learning how you write the efficient hadoop code for the aggregation query. Can you attach your hadoop code?

          Show
          Zheng Shao added a comment - Yuntao is on vacation right now - he will come back next week when he can answer questions on these better (hive configuration, etc). How many mapper slots and reducer slots are there in the cluster? How many mappers and reducers did hadoop, hive and pig take? Are you using hive trunk? What is the hive svn revision number? I am also interested in learning how you write the efficient hadoop code for the aggregation query. Can you attach your hadoop code?
          Hide
          Alan Gates added a comment -

          I ran the uservisits_aggre query on one of our clusters and got the following results:

          MR Time: 108
          Hive: 206
          Pig Time: 182

          These are wall clock times from beginning to end of the job. The MR job I used is code I wrote, not the code included in the benchmark since it was obviously sub-optimal.

          The cluster I used was fairly beefy: 10 boxes (1 as NN and JT, 9 as slaves), 4 disk (each 800G), 16G memory, 2 proc quad core 2.5GHz Xeon processors. I set io.sort.mb to 1024M, dfs.block.size to 536870912 (512M), and java heap size to 2G.

          I did not make any configuration changes to hive to take advantage of the larger boxes. Any thoughts on how I ought to tune hive for this cluster?

          Show
          Alan Gates added a comment - I ran the uservisits_aggre query on one of our clusters and got the following results: MR Time: 108 Hive: 206 Pig Time: 182 These are wall clock times from beginning to end of the job. The MR job I used is code I wrote, not the code included in the benchmark since it was obviously sub-optimal. The cluster I used was fairly beefy: 10 boxes (1 as NN and JT, 9 as slaves), 4 disk (each 800G), 16G memory, 2 proc quad core 2.5GHz Xeon processors. I set io.sort.mb to 1024M, dfs.block.size to 536870912 (512M), and java heap size to 2G. I did not make any configuration changes to hive to take advantage of the larger boxes. Any thoughts on how I ought to tune hive for this cluster?
          Yuntao Jia made changes -
          Attachment hive_benchmark_2009-07-12.tar.gz [ 12413334 ]
          Yuntao Jia made changes -
          Attachment hive_benchmark_2009-07-21.tar.gz [ 12414156 ]
          Hide
          Yuntao Jia added a comment -

          Updated the benchmark script to make it more automatic. Now it outputs all the timings to a csv file which looks like:

          Timings, grep select, rankings select, uservisits aggregation, uservisits-rankings join
          Trial 1
          Hive,126.3,25.0,546.1,447.9,
          PIG,240.5,31.0,672.3,658.3,
          Hadoop,135.4,21.6,394.9,486.1
          Trial 2
          Hive,126.3,25.0,546.1,447.9,
          PIG,240.5,31.0,672.3,658.3,
          Hadoop,135.4,21.6,394.9,486.1
          Trial 3
          Hive,126.3,25.0,546.1,447.9,
          PIG,240.5,31.0,672.3,658.3,
          Hadoop,135.4,21.6,394.9,486.1

          The first line shows the queries, followed by query timings from different trials. Within each trial, there are three lines showing the query timings on Hive, PIG and Hadoop, respectively. The numbers here are for illustration purpose only.
          The file can be directly opened in excel. User can then easily generate a performance graph on top of it

          Show
          Yuntao Jia added a comment - Updated the benchmark script to make it more automatic. Now it outputs all the timings to a csv file which looks like: Timings, grep select, rankings select, uservisits aggregation, uservisits-rankings join Trial 1 Hive,126.3,25.0,546.1,447.9, PIG,240.5,31.0,672.3,658.3, Hadoop,135.4,21.6,394.9,486.1 Trial 2 Hive,126.3,25.0,546.1,447.9, PIG,240.5,31.0,672.3,658.3, Hadoop,135.4,21.6,394.9,486.1 Trial 3 Hive,126.3,25.0,546.1,447.9, PIG,240.5,31.0,672.3,658.3, Hadoop,135.4,21.6,394.9,486.1 The first line shows the queries, followed by query timings from different trials. Within each trial, there are three lines showing the query timings on Hive, PIG and Hadoop, respectively. The numbers here are for illustration purpose only. The file can be directly opened in excel. User can then easily generate a performance graph on top of it
          Yuntao Jia made changes -
          Assignee Yuntao Jia [ yuntao ]
          Yuntao Jia made changes -
          Attachment hive_benchmark_2009-07-12.pdf [ 12413335 ]
          Yuntao Jia made changes -
          Attachment hive_benchmark_2009-07-12.pdf [ 12413737 ]
          Hide
          Yuntao Jia added a comment -

          Revising the benchmark report. Thanks Raghu Murthy for his help.

          Show
          Yuntao Jia added a comment - Revising the benchmark report. Thanks Raghu Murthy for his help.
          Yuntao Jia made changes -
          Attachment benchmark_report_2009-07-03.pdf [ 12412491 ]
          Yuntao Jia made changes -
          Attachment pig_queries.tar.gz [ 12412492 ]
          Yuntao Jia made changes -
          Attachment hive_benchmark_2009-07-12.tar.gz [ 12413334 ]
          Attachment hive_benchmark_2009-07-12.pdf [ 12413335 ]
          Hide
          Yuntao Jia added a comment -

          The latest Hive benchmark that covers a few things:

          1. Includes timing results that use Lzo compression to compress the intermediate map output data.
          2. Includes timing results without compression of the intermediate map output data.
          3. Includes the cluster hardware and software information.
          4. Includes updated Hive benchmark queries
          5. Includes updated PIG benchmark queries
          6. Includes updated hadoop job source code.
          7. A few other minor changes, such as README etc.

          Show
          Yuntao Jia added a comment - The latest Hive benchmark that covers a few things: 1. Includes timing results that use Lzo compression to compress the intermediate map output data. 2. Includes timing results without compression of the intermediate map output data. 3. Includes the cluster hardware and software information. 4. Includes updated Hive benchmark queries 5. Includes updated PIG benchmark queries 6. Includes updated hadoop job source code. 7. A few other minor changes, such as README etc.
          Hide
          Zheng Shao added a comment -

          Thanks Yuntao. Can you add some more details to the report?
          1. The exact machine hardware and software configurations: cpu, memory, disk, network, linux version, lzo library version.
          2. The speed-up percentages of changing mapred.map.output.compression.codec from gzip to lzo.

          Show
          Zheng Shao added a comment - Thanks Yuntao. Can you add some more details to the report? 1. The exact machine hardware and software configurations: cpu, memory, disk, network, linux version, lzo library version. 2. The speed-up percentages of changing mapred.map.output.compression.codec from gzip to lzo.
          Yuntao Jia made changes -
          Attachment pig_queries.tar.gz [ 12412492 ]
          Hide
          Yuntao Jia added a comment -

          The new PIG benchmark queries which incorporate the comments from Alan Gates.

          Show
          Yuntao Jia added a comment - The new PIG benchmark queries which incorporate the comments from Alan Gates.
          Yuntao Jia made changes -
          Attachment benchmark_report_2009-07-03.pdf [ 12412491 ]
          Hide
          Yuntao Jia added a comment -

          The new benchmark report which incorporates comments from Alan Gates on the PIG queries. We also used Lzo to compress the intermediate map output data.

          Show
          Yuntao Jia added a comment - The new benchmark report which incorporates comments from Alan Gates on the PIG queries. We also used Lzo to compress the intermediate map output data.
          Hide
          Yuntao Jia added a comment -

          I will post the new numbers with two changes. First, I will update the PIG queries based on you comments. Second, I will use Lzo codec for intermediate data compression.

          Show
          Yuntao Jia added a comment - I will post the new numbers with two changes. First, I will update the PIG queries based on you comments. Second, I will use Lzo codec for intermediate data compression.
          Hide
          Alan Gates added a comment -

          Any updates on this? We're anxious to see the numbers after the Pig scripts have been optimized.

          Show
          Alan Gates added a comment - Any updates on this? We're anxious to see the numbers after the Pig scripts have been optimized.
          Zheng Shao made changes -
          Link This issue incorporates HIVE-600 [ HIVE-600 ]
          Hide
          Alan Gates added a comment -

          Comments on how to speed up the Pig Latin scripts used in this benchmark.

          grep_select.pig:

          Adding types in the LOAD statement will force Pig to cast the key field, even though it doesn't need to (it only reads and writes the key field). So I'd change the query to be:

          rmf output/PIG_bench/grep_select;
          a = load '/data/grep/*' using PigStorage as (key,field);
          b = filter a by field matches '.*XYZ.*';
          store b into 'output/PIG_bench/grep_select';
          

          field will still be cast to a chararray for the matches, but we won't waste time casting key and then turning it back into bytes for the store.

          rankings_select.pig:

          Same comment, remove the casts. pagerank will be properly cast to an integer.

          rmf output/PIG_bench/rankings_select;
          a = load '/data/rankings/*' using PigStorage('|') as (pagerank,pageurl,aveduration);
          b = filter a by pagerank > 10;
          store b into 'output/PIG_bench/rankings_select';
          

          rankings_uservisits_join.pig:

          Here you want to keep the cast of pagerank so that it is handled as the right type, since AVG can take either double or int and would default to double. adRevenue will default to double in SUM when you don't specify a type.

          You want to project out all unneeded columns as soon as possible.

          You should set PARALLEL on the join to use the number of reducers appropriate for your cluster. Given that you have 10 machines and 5 reduce slots per machine, and speculative execution is off you probably want 50 reducers. (I'm assuming here when you say you have a 10 node cluster you mean 10 data nodes, not counting your name node and task tracker. The reduce formula should be 5 * number of data nodes.)

          I notice you set parallel to 60 on the group by. That will give you 10 trailing reducers. Unless you have a need for the result to be split 60 ways you should reduce that to 50 as well.

          A last question is how large are the uservisits and rankings data sets? If either is < 80M or so you can use the fragment/replicate join, which is much faster than the general join. The following script assumes that isn't the case; but if it is let me know and I can show you the syntax for it.

          So the end query looks like:

          rmf output/PIG_bench/html_join;
          a = load '/data/uservisits/*' using PigStorage('|') as
          	(sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
          b = load '/data/rankings/*' using PigStorage('|') as (pagerank:int,pageurl,aveduration);
          c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
          c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
          b1 = foreach b generate pagerank, pageurl; 
          d = JOIN c1 by destURL, b1 by pageurl parallel 50;
          d1 = foreach d generate sourceIP, pagerank, adRevenue;
          e = group d1 by sourceIP parallel 50;
          f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
          store f into 'output/PIG_bench/html_join';
          

          uservisists_agrre.pig:

          Same comments as above on projecting out as early as possible and on setting parallel appropriately for your cluster.

          rmf output/PIG_bench/uservisits_aggre;
          a = load '/data/uservisits/*' using PigStorage('|') as 
          	(sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);
          a1 = foreach a generate sourceIP, adRevenue;
          b = group a by sourceIP parallel 50;
          c = FOREACH b GENERATE group, SUM(a. adRevenue);
          store c into 'output/PIG_bench/uservisits_aggre';
          
          Show
          Alan Gates added a comment - Comments on how to speed up the Pig Latin scripts used in this benchmark. grep_select.pig: Adding types in the LOAD statement will force Pig to cast the key field, even though it doesn't need to (it only reads and writes the key field). So I'd change the query to be: rmf output/PIG_bench/grep_select; a = load '/data/grep/*' using PigStorage as (key,field); b = filter a by field matches '.*XYZ.*'; store b into 'output/PIG_bench/grep_select'; field will still be cast to a chararray for the matches, but we won't waste time casting key and then turning it back into bytes for the store. rankings_select.pig: Same comment, remove the casts. pagerank will be properly cast to an integer. rmf output/PIG_bench/rankings_select; a = load '/data/rankings/*' using PigStorage('|') as (pagerank,pageurl,aveduration); b = filter a by pagerank > 10; store b into 'output/PIG_bench/rankings_select'; rankings_uservisits_join.pig: Here you want to keep the cast of pagerank so that it is handled as the right type, since AVG can take either double or int and would default to double. adRevenue will default to double in SUM when you don't specify a type. You want to project out all unneeded columns as soon as possible. You should set PARALLEL on the join to use the number of reducers appropriate for your cluster. Given that you have 10 machines and 5 reduce slots per machine, and speculative execution is off you probably want 50 reducers. (I'm assuming here when you say you have a 10 node cluster you mean 10 data nodes, not counting your name node and task tracker. The reduce formula should be 5 * number of data nodes.) I notice you set parallel to 60 on the group by. That will give you 10 trailing reducers. Unless you have a need for the result to be split 60 ways you should reduce that to 50 as well. A last question is how large are the uservisits and rankings data sets? If either is < 80M or so you can use the fragment/replicate join, which is much faster than the general join. The following script assumes that isn't the case; but if it is let me know and I can show you the syntax for it. So the end query looks like: rmf output/PIG_bench/html_join; a = load '/data/uservisits/*' using PigStorage('|') as (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration); b = load '/data/rankings/*' using PigStorage('|') as (pagerank: int ,pageurl,aveduration); c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01'; c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue; b1 = foreach b generate pagerank, pageurl; d = JOIN c1 by destURL, b1 by pageurl parallel 50; d1 = foreach d generate sourceIP, pagerank, adRevenue; e = group d1 by sourceIP parallel 50; f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue); store f into 'output/PIG_bench/html_join'; uservisists_agrre.pig: Same comments as above on projecting out as early as possible and on setting parallel appropriately for your cluster. rmf output/PIG_bench/uservisits_aggre; a = load '/data/uservisits/*' using PigStorage('|') as (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration); a1 = foreach a generate sourceIP, adRevenue; b = group a by sourceIP parallel 50; c = FOREACH b GENERATE group, SUM(a. adRevenue); store c into 'output/PIG_bench/uservisits_aggre';
          Hide
          Zheng Shao added a comment -

          Q: Why for the first query Hive program is faster than Hadoop app?
          A: This is definitely possible in a lot of situations.
          This particular case is mainly because Hive's implementation of LIKE is using Text, while the hadoop app's implementation was using String.find(). We used the hadoop code from the SIGMOD 2009 paper to allow us to have a consistent comparison.
          While it's possible to improve the hadoop code in this particular case, there are cases that it's very hard to do the same optimization for each and every hadoop application. For example, the map-side join (HIVE-195) provides much better efficiency for joining a very small table with any other table, without using reducer. Another case is the object model in Hive is different from Hadoop - we reuse the same object across different rows. Details of this is in the org.apache.hadoop.hive.serde package.

          Show
          Zheng Shao added a comment - Q: Why for the first query Hive program is faster than Hadoop app? A: This is definitely possible in a lot of situations. This particular case is mainly because Hive's implementation of LIKE is using Text, while the hadoop app's implementation was using String.find(). We used the hadoop code from the SIGMOD 2009 paper to allow us to have a consistent comparison. While it's possible to improve the hadoop code in this particular case, there are cases that it's very hard to do the same optimization for each and every hadoop application. For example, the map-side join ( HIVE-195 ) provides much better efficiency for joining a very small table with any other table, without using reducer. Another case is the object model in Hive is different from Hadoop - we reuse the same object across different rows. Details of this is in the org.apache.hadoop.hive.serde package.
          Hide
          Zheng Shao added a comment -

          @hive_benchmark_2009-06-18.pdf and hive_benchmark_2009-06-18.tar.gz
          Note: Both the dataset and the queries are adapted from http://database.cs.brown.edu/projects/mapreduce-vs-dbms/

          Show
          Zheng Shao added a comment - @hive_benchmark_2009-06-18.pdf and hive_benchmark_2009-06-18.tar.gz Note: Both the dataset and the queries are adapted from http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
          Zheng Shao made changes -
          Attachment hive_benchmark_2009-06-18.pdf [ 12411185 ]
          Zheng Shao made changes -
          Field Original Value New Value
          Attachment hive_benchmark_2009-06-18.tar.gz [ 12411184 ]
          Zheng Shao created issue -

            People

            • Assignee:
              Yuntao Jia
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:

                Development