[HIVE-6098] Merge Tez branch into trunk - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.13.0
Component/s: None
Labels:
None

Release Note:

Hide
Here are the instructions for setting up Tez on your hadoop 2 cluster: https://github.com/apache/incubator-tez/blob/branch-0.2.0/INSTALL.txt

Notes:

- I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly necessary, but it will start the AM/containers right away instead of on first query.
- hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed with: hive.jar.directory). This avoids re-localization of the hive jar.

Hive settings:

// needed because SMB isn't supported on tez yet
set hive.optimize.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.auto.convert.sortmerge.join=false;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask=true;

// depends on your available mem/cluster, but map/reduce mb should be set to the same for container reuse
set hive.auto.convert.join.noconditionaltask.size=64000000;
set mapred.map.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;

// generic opts
set hive.optimize.reducededuplication.min.reducer=1;
set hive.optimize.mapjoin.mapreduce=true;

// autogather might require you to up the max number of counters, if you run into issues
set hive.stats.autogather=true;
set hive.stats.dbclass=counter;

// tea settings can also go into fez-site if desired
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermediate-output.should-compress=true;
set tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermdiate-input.is-compressed=true;
set tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;

// tez groups in the AM
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

set hive.orc.splits.include.file.footer=true;

set hive.root.logger=ERROR,console;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled=true;
set hive.exec.local.cache=true;
set hive.compute.query.using.stats=true;

for tez:

  <property>
    <name>tez.am.resource.memory.mb</name>
    <value>8192</value>
  </property>
  <property>
    <name>tez.am.java.opts</name>
    <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>
  </property>
  <property>
    <name>tez.am.grouping.min-size</name>
    <value>16777216</value>
  </property>
  
  <property>
    <name>tez.session.client.timeout.secs</name>
    <value>-1</value>
  </property>
  
  <property>
    <name>tez.session.pre-warm.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.session.pre-warm.num.containers</name>
    <value>10</value>
  </property>
  <property>
    <name>tez.am.grouping.split-waves</name>
    <value>0.9</value>
  </property>

  <property>
    <name>tez.am.container.reuse.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.rack-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.non-local-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.session.delay-allocation-millis</name>
    <value>-1</value>
  </property>
  <property>
    <name>tez.am.container.reuse.locality.delay-allocation-millis</name>
    <value>250</value>
  </property>

Show
Here are the instructions for setting up Tez on your hadoop 2 cluster: https://github.com/apache/incubator-tez/blob/branch-0.2.0/INSTALL.txt Notes: - I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly necessary, but it will start the AM/containers right away instead of on first query. - hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed with: hive.jar.directory). This avoids re-localization of the hive jar. Hive settings: // needed because SMB isn't supported on tez yet set hive.optimize.bucketmapjoin=false; set hive.optimize.bucketmapjoin.sortedmerge=false; set hive.auto.convert.sortmerge.join=false; set hive.auto.convert.sortmerge.join.noconditionaltask=false; set hive.auto.convert.join.noconditionaltask=true; // depends on your available mem/cluster, but map/reduce mb should be set to the same for container reuse set hive.auto.convert.join.noconditionaltask.size=64000000; set mapred.map.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true; set mapred.reduce.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; // generic opts set hive.optimize.reducededuplication.min.reducer=1; set hive.optimize.mapjoin.mapreduce=true; // autogather might require you to up the max number of counters, if you run into issues set hive.stats.autogather=true; set hive.stats.dbclass=counter; // tea settings can also go into fez-site if desired set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; set tez.runtime.intermediate-output.should-compress=true; set tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; set tez.runtime.intermdiate-input.is-compressed=true; set tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; // tez groups in the AM set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.orc.splits.include.file.footer=true; set hive.root.logger=ERROR,console; set hive.execution.engine=tez; set hive.vectorized.execution.enabled=true; set hive.exec.local.cache=true; set hive.compute.query.using.stats=true; for tez:   <property>     <name>tez.am.resource.memory.mb</name>     <value>8192</value>   </property>   <property>     <name>tez.am.java.opts</name>     <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>   </property>   <property>     <name>tez.am.grouping.min-size</name>     <value>16777216</value>   </property>      <property>     <name>tez.session.client.timeout.secs</name>     <value>-1</value>   </property>      <property>     <name>tez.session.pre-warm.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.session.pre-warm.num.containers</name>     <value>10</value>   </property>   <property>     <name>tez.am.grouping.split-waves</name>     <value>0.9</value>   </property>   <property>     <name>tez.am.container.reuse.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.reuse.rack-fallback.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.reuse.non-local-fallback.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.session.delay-allocation-millis</name>     <value>-1</value>   </property>   <property>     <name>tez.am.container.reuse.locality.delay-allocation-millis</name>     <value>250</value>   </property>

Description

I think the Tez branch is at a point where we can consider merging it back into trunk after review.

Tez itself has had its first release, most hive features are available on Tez and the test coverage is decent. There are a few known limitations, all of which can be handled in trunk as far as I can tell (i.e.: None of them are large disruptive changes that still require a branch.)

Limitations:

Union all is not yet supported on Tez
SMB is not yet supported on Tez
Bucketed map-join is executed as broadcast join (bucketing is ignored)

Since the user is free to toggle hive.optimize.tez, it's obviously possible to just run these on MR.

I am hoping to follow the approach that was taken with vectorization and shoot for a merge instead of single commit. This would retain history of the branch. Also in vectorization we required at least three +1s before merge, I'm hoping to go with that as well.

I will add a combined patch to this ticket for review purposes (not for commit). I'll also attach instructions to run on a cluster if anyone wants to try.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-6098.1.patch
23/Dec/13 14:57
4.59 MB
Gunther Hagleitner
HIVE-6098.10.patch
09/Jan/14 02:48
1.52 MB
Gunther Hagleitner
HIVE-6098.2.patch
23/Dec/13 17:33
4.58 MB
Gunther Hagleitner
HIVE-6098.3.patch
23/Dec/13 21:29
4.58 MB
Gunther Hagleitner
HIVE-6098.4.patch
24/Dec/13 16:11
4.58 MB
Gunther Hagleitner
HIVE-6098.5.patch
26/Dec/13 04:47
4.60 MB
Gunther Hagleitner
HIVE-6098.6.patch
27/Dec/13 06:35
4.60 MB
Gunther Hagleitner
HIVE-6098.7.patch
27/Dec/13 21:32
4.60 MB
Gunther Hagleitner
HIVE-6098.8.patch
07/Jan/14 00:41
1.52 MB
Gunther Hagleitner
HIVE-6098.9.patch
09/Jan/14 01:50
1.52 MB
Gunther Hagleitner
hive-on-tez-conf.txt
23/Dec/13 15:48
3 kB
Gunther Hagleitner

Issue Links

blocks

HIVE-4660 Let there be Tez

Resolved

is blocked by

HIVE-6125 Tez: Refactoring changes

Resolved

HIVE-6168 Fix some javadoc issues on Tez branch

Resolved

HIVE-6172 Whitespaces and comments on Tez

Resolved

is related to

HIVE-6128 Add tez variables to hive-default.xml

Resolved

HIVE-6636 /user/hive is a bad default for HDFS jars path for Tez

Resolved

(1 is related to)

Merge Tez branch into trunk

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates