Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6098

Merge Tez branch into trunk

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.12.0
    • 0.13.0
    • None
    • None
    • Hide
      Here are the instructions for setting up Tez on your hadoop 2 cluster: https://github.com/apache/incubator-tez/blob/branch-0.2.0/INSTALL.txt

      Notes:

      - I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly necessary, but it will start the AM/containers right away instead of on first query.
      - hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed with: hive.jar.directory). This avoids re-localization of the hive jar.

      Hive settings:

      // needed because SMB isn't supported on tez yet
      set hive.optimize.bucketmapjoin=false;
      set hive.optimize.bucketmapjoin.sortedmerge=false;
      set hive.auto.convert.sortmerge.join=false;
      set hive.auto.convert.sortmerge.join.noconditionaltask=false;
      set hive.auto.convert.join.noconditionaltask=true;

      // depends on your available mem/cluster, but map/reduce mb should be set to the same for container reuse
      set hive.auto.convert.join.noconditionaltask.size=64000000;
      set mapred.map.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true;
      set mapred.reduce.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true;
      set mapreduce.map.memory.mb=4096;
      set mapreduce.reduce.memory.mb=4096;

      // generic opts
      set hive.optimize.reducededuplication.min.reducer=1;
      set hive.optimize.mapjoin.mapreduce=true;

      // autogather might require you to up the max number of counters, if you run into issues
      set hive.stats.autogather=true;
      set hive.stats.dbclass=counter;

      // tea settings can also go into fez-site if desired
      set mapreduce.map.output.compress=true;
      set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
      set tez.runtime.intermediate-output.should-compress=true;
      set tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
      set tez.runtime.intermdiate-input.is-compressed=true;
      set tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;

      // tez groups in the AM
      set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

      set hive.orc.splits.include.file.footer=true;

      set hive.root.logger=ERROR,console;
      set hive.execution.engine=tez;
      set hive.vectorized.execution.enabled=true;
      set hive.exec.local.cache=true;
      set hive.compute.query.using.stats=true;

      for tez:

        <property>
          <name>tez.am.resource.memory.mb</name>
          <value>8192</value>
        </property>
        <property>
          <name>tez.am.java.opts</name>
          <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>
        </property>
        <property>
          <name>tez.am.grouping.min-size</name>
          <value>16777216</value>
        </property>
        <!-- Client Submission timeout value when submitting DAGs to a session -->
        <property>
          <name>tez.session.client.timeout.secs</name>
          <value>-1</value>
        </property>
        <!-- prewarm stuff -->
        <property>
          <name>tez.session.pre-warm.enabled</name>
          <value>true</value>
        </property>

        <property>
          <name>tez.session.pre-warm.num.containers</name>
          <value>10</value>
        </property>
        <property>
          <name>tez.am.grouping.split-waves</name>
          <value>0.9</value>
        </property>

        <property>
          <name>tez.am.container.reuse.enabled</name>
          <value>true</value>
        </property>
        <property>
          <name>tez.am.container.reuse.rack-fallback.enabled</name>
          <value>true</value>
        </property>
        <property>
          <name>tez.am.container.reuse.non-local-fallback.enabled</name>
          <value>true</value>
        </property>
        <property>
          <name>tez.am.container.session.delay-allocation-millis</name>
          <value>-1</value>
        </property>
        <property>
          <name>tez.am.container.reuse.locality.delay-allocation-millis</name>
          <value>250</value>
        </property>
      Show
      Here are the instructions for setting up Tez on your hadoop 2 cluster: https://github.com/apache/incubator-tez/blob/branch-0.2.0/INSTALL.txt Notes: - I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly necessary, but it will start the AM/containers right away instead of on first query. - hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed with: hive.jar.directory). This avoids re-localization of the hive jar. Hive settings: // needed because SMB isn't supported on tez yet set hive.optimize.bucketmapjoin=false; set hive.optimize.bucketmapjoin.sortedmerge=false; set hive.auto.convert.sortmerge.join=false; set hive.auto.convert.sortmerge.join.noconditionaltask=false; set hive.auto.convert.join.noconditionaltask=true; // depends on your available mem/cluster, but map/reduce mb should be set to the same for container reuse set hive.auto.convert.join.noconditionaltask.size=64000000; set mapred.map.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true; set mapred.reduce.child.java.opts=-server -Xmx3584m -Djava.net.preferIPv4Stack=true; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; // generic opts set hive.optimize.reducededuplication.min.reducer=1; set hive.optimize.mapjoin.mapreduce=true; // autogather might require you to up the max number of counters, if you run into issues set hive.stats.autogather=true; set hive.stats.dbclass=counter; // tea settings can also go into fez-site if desired set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; set tez.runtime.intermediate-output.should-compress=true; set tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; set tez.runtime.intermdiate-input.is-compressed=true; set tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec; // tez groups in the AM set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.orc.splits.include.file.footer=true; set hive.root.logger=ERROR,console; set hive.execution.engine=tez; set hive.vectorized.execution.enabled=true; set hive.exec.local.cache=true; set hive.compute.query.using.stats=true; for tez:   <property>     <name>tez.am.resource.memory.mb</name>     <value>8192</value>   </property>   <property>     <name>tez.am.java.opts</name>     <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>   </property>   <property>     <name>tez.am.grouping.min-size</name>     <value>16777216</value>   </property>   <!-- Client Submission timeout value when submitting DAGs to a session -->   <property>     <name>tez.session.client.timeout.secs</name>     <value>-1</value>   </property>   <!-- prewarm stuff -->   <property>     <name>tez.session.pre-warm.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.session.pre-warm.num.containers</name>     <value>10</value>   </property>   <property>     <name>tez.am.grouping.split-waves</name>     <value>0.9</value>   </property>   <property>     <name>tez.am.container.reuse.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.reuse.rack-fallback.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.reuse.non-local-fallback.enabled</name>     <value>true</value>   </property>   <property>     <name>tez.am.container.session.delay-allocation-millis</name>     <value>-1</value>   </property>   <property>     <name>tez.am.container.reuse.locality.delay-allocation-millis</name>     <value>250</value>   </property>

    Description

      I think the Tez branch is at a point where we can consider merging it back into trunk after review.

      Tez itself has had its first release, most hive features are available on Tez and the test coverage is decent. There are a few known limitations, all of which can be handled in trunk as far as I can tell (i.e.: None of them are large disruptive changes that still require a branch.)

      Limitations:

      • Union all is not yet supported on Tez
      • SMB is not yet supported on Tez
      • Bucketed map-join is executed as broadcast join (bucketing is ignored)

      Since the user is free to toggle hive.optimize.tez, it's obviously possible to just run these on MR.

      I am hoping to follow the approach that was taken with vectorization and shoot for a merge instead of single commit. This would retain history of the branch. Also in vectorization we required at least three +1s before merge, I'm hoping to go with that as well.

      I will add a combined patch to this ticket for review purposes (not for commit). I'll also attach instructions to run on a cluster if anyone wants to try.

      Attachments

        1. HIVE-6098.1.patch
          4.59 MB
          Gunther Hagleitner
        2. hive-on-tez-conf.txt
          3 kB
          Gunther Hagleitner
        3. HIVE-6098.2.patch
          4.58 MB
          Gunther Hagleitner
        4. HIVE-6098.3.patch
          4.58 MB
          Gunther Hagleitner
        5. HIVE-6098.4.patch
          4.58 MB
          Gunther Hagleitner
        6. HIVE-6098.5.patch
          4.60 MB
          Gunther Hagleitner
        7. HIVE-6098.6.patch
          4.60 MB
          Gunther Hagleitner
        8. HIVE-6098.7.patch
          4.60 MB
          Gunther Hagleitner
        9. HIVE-6098.8.patch
          1.52 MB
          Gunther Hagleitner
        10. HIVE-6098.9.patch
          1.52 MB
          Gunther Hagleitner
        11. HIVE-6098.10.patch
          1.52 MB
          Gunther Hagleitner

        Issue Links

          Activity

            People

              hagleitn Gunther Hagleitner
              hagleitn Gunther Hagleitner
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: