diff --git README.txt README.txt index 54e12ca..ca72fd3 100644 --- README.txt +++ README.txt @@ -1,39 +1,53 @@ How to Hive ----------- -DISCLAIMER: This is a prototype version of Hive and is NOT production -quality. This is provided mainly as a way of illustrating the capabilities -of Hive and is provided as-is. However - we are working hard to make -Hive a production quality system. Hive has only been tested on unix(linux) -and mac systems using Java 1.6 for now - although it may very well work -on other similar platforms. It does not work on Cygwin right now. Most of -our testing has been on Hadoop 0.17 - so we would advise running it against -this version of hadoop - even though it may compile/work against other versions +DISCLAIMER: This is a prototype version of Hive and is NOT production +quality. This is provided mainly as a way of illustrating the +capabilities of Hive and is provided as-is. However - we are working +hard to make Hive a production quality system. Hive has only been +tested on Unix(Linux) and Mac OS X using Java 1.6 for now - although +it may very well work on other similar platforms. It does not work on +Cygwin. Most of our testing has been on Hadoop 0.17 - so we would +advise running it against this version of hadoop - even though it may +compile/work against other versions Useful mailing lists -------------------- -1. hive-user@hadoop.apache.org - To discuss and ask usage questions. Send an - empty email to hive-user-subscribe@hadoop.apache.org in order to subscribe - to this mailing list. +1. hive-user@hadoop.apache.org - To discuss and ask usage + questions. Send an empty email to + hive-user-subscribe@hadoop.apache.org in order to subscribe to this + mailing list. -2. hive-core@hadoop.apache.org - For discussions around code, design and - features. Send an empty email to hive-core-subscribe@hadoop.apache.org in - order to subscribe to this mailing list. +2. hive-core@hadoop.apache.org - For discussions around code, design + and features. Send an empty email to + hive-core-subscribe@hadoop.apache.org in order to subscribe to this + mailing list. + +3. hive-commits@hadoop.apache.org - In order to monitor the commits to + the source repository. Send an empty email to + hive-commits-subscribe@hadoop.apache.org in order to subscribe to + this mailing list. -3. hive-commits@hadoop.apache.org - In order to monitor the commits to the - source repository. Send an empty email to hive-commits-subscribe@hadoop.apache.org in - order to subscribe to this mailing list. - Downloading and building ------------------------ - svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive_trunk - cd hive_trunk -- hive_trunk> ant -Dtarget.dir= -Dhadoop.version='0.17.0' package +- hive_trunk> ant package -Dtarget.dir= -Dhadoop.version=0.17.2.1 + +If you do not specify a value for target.dir it will use the default +value build/dist. Then you can run the unit tests if you want to make +sure it is correctly built: + +- hive_trunk> ant test -Dtarget.dir= -Dhadoop.version=0.17.2.1 \ + -logfile test.log + +You can replace 0.17.2.1 with 0.18.3, 0.19.0, etc. to match the +version of hadoop that you are using. If you do not specify +hadoop.version it will use the default value specified in +build.properties file. -You can replace 0.17.0 with 0.18.1, 0.19.0 etc to match the version of hadoop -that you are using. In the rest of the README, we use dist and interchangeably. @@ -54,14 +68,14 @@ Using Hive Configuration management overview --------------------------------- -- hive configuration is stored in /conf/hive-default.xml +- hive configuration is stored in /conf/hive-default.xml and log4j in hive-log4j.properties -- hive configuration is an overlay on top of hadoop - meaning the +- hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default. - hive configuration can be manipulated by: - o editing hive-default.xml and defining any desired variables + o editing hive-default.xml and defining any desired variables (including hadoop variables) in it - o from the cli using the set command (see below) + o from the cli using the set command (see below) o by invoking hive using the syntax: * bin/hive -hiveconf x1=y1 -hiveconf x2=y2 this sets the variables x1 and x2 to y1 and y2 @@ -69,18 +83,18 @@ Configuration management overview Error Logs ---------- -Hive uses log4j for logging. By default logs are not emitted to the +Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log -If the user wishes - the logs can be emitted to the console by adding +If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - bin/hive -hiveconf hive.root.logger=INFO,console -Note that setting hive.root.logger via the 'set' command does not +Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time. -Error logs are very useful to debug problems. Please send them with +Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to the JIRA at https://issues.apache.org/jira/browse/HIVE. @@ -92,21 +106,21 @@ Creating Hive tables and browsing through them hive> CREATE TABLE pokes (foo INT, bar STRING); -Creates a table called pokes with two columns, first being an +Creates a table called pokes with two columns, first being an integer and other a string columns hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); -Creates a table called invites with two columns and a partition column -called ds. The partition column is a virtual column. It is not part -of the data itself, but is derived from the partition that a particular -dataset is loaded into. +Creates a table called invites with two columns and a partition column +called ds. The partition column is a virtual column. It is not part +of the data itself, but is derived from the partition that a +particular dataset is loaded into. -By default tables are assumed to be of text input format and the -delimiters are assumed to be ^A(ctrl-a). We will be soon publish additional -commands/recipes to add binary (sequencefiles) data and configurable -delimiters etc. +By default tables are assumed to be of text input format and the +delimiters are assumed to be ^A(ctrl-a). We will be soon publish +additional commands/recipes to add binary (sequencefiles) data and +configurable delimiters etc. hive> SHOW TABLES; @@ -114,8 +128,8 @@ lists all the tables hive> SHOW TABLES '.*s'; -lists all the table that end with 's'. The pattern matching follows Java -regular expressions. Check out this link for documentation +lists all the table that end with 's'. The pattern matching follows Java +regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html hive> DESCRIBE invites; @@ -140,18 +154,18 @@ hive> DROP TABLE pokes; Metadata Store -------------- -Metadata is in an embedded Derby database whose location is determined by the -hive configuration variable named javax.jdo.option.ConnectionURL. By default +Metadata is in an embedded Derby database whose location is determined by the +hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml) - this location is ./metastore_db -Right now - in the default configuration, this metadata can only be seen by -one user at a time. +Right now - in the default configuration, this metadata can only be seen by +one user at a time. -Metastore can be stored in any database that is supported by JPOX. The -location and the type of the RDBMS can be controlled by the two variables -'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. -Refer to JDO (or JPOX) documentation for more details on supported databases. -The database schema is defined in JDO metadata annotations file package.jdo +Metastore can be stored in any database that is supported by JPOX. The +location and the type of the RDBMS can be controlled by the two variables +'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. +Refer to JDO (or JPOX) documentation for more details on supported databases. +The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model. In the future - the metastore itself can be a standalone server. @@ -162,27 +176,35 @@ DML Operations Loading data from flat files into Hive -hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; +hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; -Loads a file that contains two columns separated by ctrl-a into pokes table. -'local' signifies that the input file is on the local system. If 'local' -is omitted then it looks for the file in HDFS. +Loads a file that contains two columns separated by ctrl-a into pokes +table. 'local' signifies that the input file is on the local +system. If 'local' is omitted then it looks for the file in HDFS. -The keyword 'overwrite' signifies that existing data in the table is deleted. -If the 'overwrite' keyword is omitted - then data files are appended to existing data sets. +The keyword 'overwrite' signifies that existing data in the table is +deleted. If the 'overwrite' keyword is omitted - then data files are +appended to existing data sets. NOTES: + - NO verification of data against the schema -- if the file is in hdfs it is moved into hive controlled file system namespace. - The root of the hive directory is specified by the option hive.metastore.warehouse.dir - in hive-default.xml. We would advise that this directory be pre-existing before - trying to create tables via Hive. -hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); -hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); +- If the file is in hdfs it is moved into hive controlled file system + namespace. The root of the hive directory is specified by the + option hive.metastore.warehouse.dir in hive-default.xml. We would + advise that this directory be pre-existing before trying to create + tables via Hive. + +hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' \ + OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); -The two LOAD statements above load data into two different partitions of the table -invites. Table invites must be created as partitioned by the key ds for this to succeed. +hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' \ + OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); + +The two LOAD statements above load data into two different partitions +of the table invites. Table invites must be created as partitioned by +the key ds for this to succeed. Loading/Extracting data using Queries ------------------------------------- @@ -190,48 +212,56 @@ Loading/Extracting data using Queries Runtime configuration --------------------- -- Hives queries are executed using map-reduce queries and as such the behavior - of such queries can be controlled by the hadoop configuration variables +- Hives queries are executed using map-reduce queries and as such the + behavior of such queries can be controlled by the hadoop + configuration variables + +- The cli can be used to set any hadoop (or hive) configuration + variable. For example: + + hive> SET mapred.job.tracker=myhost.mycompany.com:50030 + hive> SET - v -- The cli can be used to set any hadoop (or hive) configuration variable. For example: - o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 - o hive> SET - v - The latter shows all the current settings. Without the v option only the - variables that differ from the base hadoop configuration are displayed -- In particular the number of reducers should be set to a reasonable number - to get good performance (the default is 1!) + The latter shows all the current settings. Without the v option only + the variables that differ from the base hadoop configuration are + displayed + +- In particular the number of reducers should be set to a reasonable + number to get good performance (the default is 1!) EXAMPLE QUERIES --------------- -Some example queries are shown below. More are available in the hive code: -ql/src/test/queries/{positive,clientpositive}. +Some example queries are shown below. More are available in the hive +code: ql/src/test/queries/{positive,clientpositive}. SELECTS and FILTERS ------------------- hive> SELECT a.foo FROM invites a; -select column 'foo' from all rows of invites table. The results are not -stored anywhere, but are displayed on the console. +Select column 'foo' from all rows of invites table. The results are +not stored anywhere, but are displayed on the console. -Note that in all the examples that follow, INSERT (into a hive table, local -directory or HDFS directory) is optional. +Note that in all the examples that follow, INSERT (into a hive table, +local directory or HDFS directory) is optional. hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; -select all rows from invites table into an HDFS directory. The result data -is in files (depending on the number of mappers) in that directory. -NOTE: partition columns if any are selected by the use of *. They can also -be specified in the projection clauses. +Select all rows from invites table into an HDFS directory. The result +data is in files (depending on the number of mappers) in that +directory. + +NOTE: partition columns if any are selected by the use of *. They can +also be specified in the projection clauses. hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; Select all rows from pokes table into a local directory hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; -hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; +hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; @@ -240,8 +270,9 @@ hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Sum of a column. avg, min, max can also be used -NOTE: there are some flaws with the type system that cause doubles to be -returned with integer types would be expected. We expect to fix these in the coming week. +NOTE: there are some flaws with the type system that cause doubles to +be returned with integer types would be expected. We expect to fix +these in the coming week. GROUP BY -------- @@ -249,8 +280,9 @@ GROUP BY hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; -NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is -to handle skews in input data. We will be optimizing this in the coming weeks. +NOTE: Currently Hive always uses two stage map-reduce for groupby +operation. This is to handle skews in input data. We will be +optimizing this in the coming weeks. JOIN ---- @@ -267,24 +299,36 @@ INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key STREAMING --------- -hive> FROM invites a INSERT OVERWRITE TABLE events - > MAP a.foo, a.bar USING '/bin/cat' +hive> FROM invites a INSERT OVERWRITE TABLE events + > MAP a.foo, a.bar USING '/bin/cat' > AS oof, rab WHERE a.ds > '2008-08-09'; -This streams the data in the map phase through the script /bin/cat (like hadoop streaming). -Similarly - streaming can be used on the reduce side. Please look for files mapreduce*.q. +This streams the data in the map phase through the script /bin/cat +(like hadoop streaming). Similarly, streaming can be used on the +reduce side. Please look for files mapreduce*.q. KNOWN BUGS/ISSUES ----------------- -* hive cli may hang for a couple of minutes because of a bug in getting metadata - from the derby database. let it run and you'll be fine! -* hive cli creates derby.log in the directory from which it has been invoked. + +* Hive cli may hang for a couple of minutes because of a bug in + getting metadata from the derby database. let it run and you'll be + fine! + +* Hive cli creates derby.log in the directory from which it has been + invoked. + * COUNT(*) does not work for now. Use COUNT(1) instead. + * ORDER BY not supported yet. + * CASE not supported yet. -* Only string and thrift types (http://developers.facebook.com/thrift) have been tested. -* When doing Join, please put the table with big number of rows containing the same join key to -the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. + +* Only string and Thrift types (http://developers.facebook.com/thrift) + have been tested. + +* When doing Join, please put the table with big number of rows + containing the same join key to the rightmost in the JOIN + clause. Otherwise we may see OutOfMemory errors. FUTURE FEATURES --------------- @@ -295,123 +339,22 @@ Developing Hive using Eclipse ------------------------ 1. Follow the 3 steps in "Downloading and building" section above -2. Change the first line in conf/hive-log4j.properties to the following - line to see error messages on the console. -hive.root.logger=INFO,console - -3. Run tests to make sure everything works. It may take 20 minutes. -ant -Dhadoop.version='0.17.0' -logfile test.log test" - -4. Create an empty java project in Eclipse and close it. - -5. Add the following section to Eclipse project's .project file: - - - - cli_src_java - 2 - /path_to_hive_trunk/cli/src/java - - - common_src_java - 2 - /path_to_hive_trunk/common/src/java - - - java - 2 - /path_to_hive_trunk/shims/src/0.20/java - - - metastore_src_gen-javabean - 2 - /path_to_hive_trunk/metastore/src/gen-javabean - - - metastore_src_java - 2 - /path_to_hive_trunk/metastore/src/java - - - metastore_src_model - 2 - /path_to_hive_trunk/metastore/src/model - - - ql_src_java - 2 - /path_to_hive_trunk/ql/src/java - - - ql_src_gen-javabean - 2 - /path_to_hive_trunk/ql/src/gen-javabean - - - ql_src_gen-java - 2 - /path_to_hive_trunk/build/ql/gen-java - - - serde_src_gen-java - 2 - /path_to_hive_trunk/serde/src/gen-java - - - serde_src_java - 2 - /path_to_hive_trunk/serde/src/java - - - shims_src_common_java - 2 - /path_to_hive_trunk/shims/src/common/java - - - shims_src_0.20_java - 2 - /path_to_hive_trunk/shims/src/0.20/java - - - - -6. Add the following list to the Eclipse project's .classpath file: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -7. Try building hive inside Eclipse, and develop using Eclipse. +2. Generate test files for eclipse: + + hive_trunk> cd metastore; ant model-jar; cd .. + hive_trunk> cd ql; ant gen-test; cd .. + hive_trunk> ant eclipse-files + + This will generate .project, .settings, .externalToolBuilders, + .classpath, and a set of *.launch files for Eclipse-based + debugging. + +3. Launch Eclipse and import the Hive project by navigating to + File->Import->General->Existing Projects menu and selecting + the hive_trunk directory. +4. Run the CLI or a set of test cases by right clicking on one of the + *.launch configurations and selection Run or Debug. Development Tips ------------------------ diff --git build.xml build.xml index 40da9f9..3d4b3fa 100644 --- build.xml +++ build.xml @@ -342,6 +342,7 @@ + diff --git data/conf/hive-site.xml data/conf/hive-site.xml index 71d4d7b..0db0487 100644 --- data/conf/hive-site.xml +++ data/conf/hive-site.xml @@ -103,7 +103,7 @@ hive.jar.path - ${build.dir.hive}/ql/hive_exec.jar + ${build.dir.hive}/ql/hive-exec-${version}.jar diff --git eclipse-templates/HiveCLI.launchtemplate eclipse-templates/HiveCLI.launchtemplate new file mode 100644 index 0000000..83b5195 --- /dev/null +++ eclipse-templates/HiveCLI.launchtemplate @@ -0,0 +1,32 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git eclipse-templates/TestCliDriver.launchtemplate eclipse-templates/TestCliDriver.launchtemplate index 7beff5b..8833251 100644 --- eclipse-templates/TestCliDriver.launchtemplate +++ eclipse-templates/TestCliDriver.launchtemplate @@ -1,23 +1,26 @@ - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + diff --git eclipse-templates/TestHive.launchtemplate eclipse-templates/TestHive.launchtemplate index 5533a6b..4d740ad 100644 --- eclipse-templates/TestHive.launchtemplate +++ eclipse-templates/TestHive.launchtemplate @@ -1,23 +1,26 @@ - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + diff --git eclipse-templates/TestJdbc.launchtemplate eclipse-templates/TestJdbc.launchtemplate index 4407d91..2f75b17 100644 --- eclipse-templates/TestJdbc.launchtemplate +++ eclipse-templates/TestJdbc.launchtemplate @@ -1,23 +1,27 @@ - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + diff --git eclipse-templates/TestMTQueries.launchtemplate eclipse-templates/TestMTQueries.launchtemplate index 30d1465..d21b579 100644 --- eclipse-templates/TestMTQueries.launchtemplate +++ eclipse-templates/TestMTQueries.launchtemplate @@ -1,23 +1,26 @@ - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + diff --git eclipse-templates/TestTruncate.launchtemplate eclipse-templates/TestTruncate.launchtemplate index b9f494c..fcde6b8 100644 --- eclipse-templates/TestTruncate.launchtemplate +++ eclipse-templates/TestTruncate.launchtemplate @@ -1,23 +1,26 @@ - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + +