Index: CHANGES.txt =================================================================== --- CHANGES.txt (revision 0) +++ CHANGES.txt (revision 0) @@ -0,0 +1,17 @@ +Hive Change Log + +Trunk (unreleased changes) + + INCOMPATIBLE CHANGES + + NEW FEATURES + + IMPROVEMENTS + + OPTIMIZATIONS + + BUG FIXES + + HIVE-54. Create this file to record the patches going into a + particular release. (dhruba) + Index: README =================================================================== --- README (revision 713120) +++ README (working copy) @@ -1,353 +0,0 @@ -How to Hive ------------ - -DISCLAIMER: This is a prototype version of Hive and is NOT production -quality. This is provided mainly as a way of illustrating the capabilities -of Hive and is provided as-is. However - we are working hard to make -Hive a production quality system. Hive has only been tested on unix(linux) -and mac systems using Java 1.6 for now - although it may very well work -on other similar platforms. It does not work on Cygwin right now. Most of -our testing has been on Hadoop 0.17 - so we would advise running it against -this version of hadoop - even though it may compile/work against other versions - - -Downloading and building ------------------------- - -You can either build hive jar files for your environment: -- apply patch -- cd hadoop/src/contrib/hive -- src/contrib/hive> ant -Dtarget.dir= package - -Or you can use the pre-packaged jars that is available at the following location -and untar it: -http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz - -$ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz -$ tar zxvf dist.tar.gz -$ cd dist -$ ls -README bin conf lib - -bin/ (all the shell scripts) -lib/ (required jar files) -conf/ (configuration files) - -In the rest of the README, we use dist and interchangeably. - -Running Hive ------------- - -Hive uses hadoop that means: -- you must have hadoop in your path OR -- export HADOOP=/bin/hadoop - -To use hive command line interface (cli) from the shell: -$ bin/hive - -Using Hive ----------- - -Configuration management overview ---------------------------------- - -- hive configuration is stored in /conf/hive-default.xml - and log4j in hive-log4j.properties -- hive configuration is an overlay on top of hadoop - meaning the - hadoop configuration variables are inherited by default. -- hive configuration can be manipulated by: - o editing hive-default.xml and defining any desired variables - (including hadoop variables) in it - o from the cli using the set command (see below) - o by invoking hive using the syntax: - * bin/hive -hiveconf x1=y1 -hiveconf x2=y2 - this sets the variables x1 and x2 to y1 and y2 - - -Error Logs ----------- -Hive uses log4j for logging. By default logs are not emitted to the -console by the cli. They are stored in the file: -- /tmp/{user.name}/hive.log - -If the user wishes - the logs can be emitted to the console by adding -the arguments shown below: -- bin/hive -hiveconf hive.root.logger=INFO,console - -Note that setting hive.root.logger via the 'set' command does not -change logging properties since they are determined at initialization time. - -Error logs are very useful to debug problems. Please send them with -any bugs (of which there are many!) to athusoo@facebok.com. - - -DDL Operations --------------- - -Creating Hive tables and browsing through them - -hive> CREATE TABLE pokes (foo INT, bar STRING); - -Creates a table called pokes with two columns, first being an -integer and other a string columns - -hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); - -Creates a table called invites with two columns and a partition column -called ds. The partition column is a virtual column it is not part -of the data itself but is derived from the partition that a -particular dataset is loaded into. - - -By default tables are assumed to be of text input format and the -delimiters are assumed to be ^A(ctrl-a). We will be soon publish additional -commands/recipes to add binary (sequencefiles) data and configurable -delimiters etc. - -hive> SHOW TABLES; - -lists all the tables - -hive> SHOW TABLES '.*s'; - -lists all the table that end with 's'. The pattern matching follows Java regular -expressions. Check out this link for documentation -http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html - -hive> DESCRIBE invites; - -shows the list of columns - -hive> DESCRIBE EXTENDED invites; - -shows the list of columns plus any other meta information about the table - -Altering tables. Table name can be changed and additional columns can be dropped - -hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); -hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); -hive> ALTER TABLE events RENAME TO 3koobecaf; - -Dropping tables -hive> DROP TABLE pokes; - - -Metadata Store --------------- - -Metadata is in an embedded Derby database whose location is determined by the -hive configuration variable named javax.jdo.option.ConnectionURL. By default -(see conf/hive-default.xml) - this location is ./metastore_db - -Right now - in the default configuration, this metadata can only be seen by -one user at a time. - -Metastore can be stored in any database that is supported by JPOX. The -location and the type of the RDBMS can be controlled by the two variables -'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. -Refer to JDO (or JPOX) documentation for more details on supported databases. -The database schema is defined in JDO metadata annotations file package.jdo -at src/contrib/hive/metastore/src/model. - -In the future - the metastore itself can be a standalone server. - - -DML Operations --------------- - -Loading data from flat files into Hive - -hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; - -Loads a file that contains two columns separated by ctrl-a into pokes table. -'local' signifies that the input file is on the local system. If 'local' -is omitted then it looks for the file in HDFS. - -The keyword 'overwrite' signifies that existing data in the table is deleted. -If the 'overwrite' keyword is omitted - then data files are appended to existing data sets. - -NOTES: -- NO verification of data against the schema -- if the file is in hdfs it is moved into hive controlled file system namespace. - The root of the hive directory is specified by the option hive.metastore.warehouse.dir - in hive-default.xml. We would advise that this directory be pre-existing before - trying to create tables via Hive. - -hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); -hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); - -The two LOAD statements above load data into two different partitions of the table -invites. Table invites must be created as partitioned by the key ds for this to succeed. - -Loading/Extracting data using Queries -------------------------------------- - -Runtime configuration ---------------------- - -- Hives queries are executed using map-reduce queries and as such the behavior - of such queries can be controlled by the hadoop configuration variables - -- The cli can be used to set any hadoop (or hive) configuration variable. For example: - o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 - o hive> SET - v - The latter shows all the current settings. Without the v option only the - variables that differ from the base hadoop configuration are displayed -- In particular the number of reducers should be set to a reasonable number - to get good performance (the default is 1!) - - -EXAMPLE QUERIES ---------------- - -Some example queries are shown below. They are available in examples/queries. -More are available in the hive contrib sources src/test/queries/positive - -SELECTS and FILTERS -------------------- - -hive> SELECT a.foo FROM invites a; - -select column 'foo' from all rows of invites table. The results are not -stored anywhere, but are displayed on the console. - -Note that in all the examples that follow, INSERT (into a hive table, local -directory or HDFS directory) is optional. - -hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; - -select all rows from invites table into an HDFS directory. The result data -is in files (depending on the number of mappers) in that directory. -NOTE: partition columns if any are selected by the use of *. They can also -be specified in the projection clauses. - -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; - -Select all rows from pokes table into a local directory - -hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; -hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; - -Sum of a column. avg, min, max can also be used - -NOTE: there are some flaws with the type system that cause doubles to be -returned with integer types would be expected. We expect to fix these in the coming week. - -GROUP BY --------- - -hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; -hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; - -NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is -to handle skews in input data. We will be optimizing this in the coming weeks. - -JOIN ----- - -hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo - -MULTITABLE INSERT ------------------ -FROM src -INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 -INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 -INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 -INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 - -STREAMING ---------- -hive> FROM invites a INSERT OVERWRITE TABLE events - > SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) - > USING '/bin/cat' WHERE a.ds > '2008-08-09'; - -This streams the data in the map phase through the script /bin/cat (like hadoop streaming). -Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) - -KNOWN BUGS/ISSUES ------------------ -* hive cli may hang for a couple of minutes because of a bug in getting metadata - from the derby database. let it run and you'll be fine! -* hive cli creates derby.log in the directory from which it has been invoked. -* COUNT(*) does not work for now. Use COUNT(1) instead. -* ORDER BY not supported yet. -* Only string and thrift types (http://developers.facebook.com/thrift) have been tested. -* When doing Join, please put the table with big number of rows containing the same join key to -the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. - -FUTURE FEATURES ---------------- -* EXPLODE function to generate multiple rows from a column of list type. -* Simpler syntax for running Map/Reduce scripts. -* ORDER BY and SORT BY. -* Table statistics for query optimization. - -Developing Hive using Eclipse ------------------------- -1. Set up hadoop development environment with Eclipse: -http://wiki.apache.org/hadoop/EclipseEnvironment - -2. Download Hive src code from: -http://mirror.facebook.com/facebook/hive - -If hadoop version is 0.17.x or 0.18.x, use -http://mirror.facebook.com/facebook/hive/hadoop-0.17/ - -If hadoop version is 0.19.x or up or trunk, use -http://mirror.facebook.com/facebook/hive/hadoop-0.19/ - -3. Extract the Hive src code to src/contrib/hive, make sure this file (README) - is in src/contrib/hive. - -4. In src/contrib/hive, run "ant package" - -5. In src/contrib/hive, run "ant -logfile test.log test" to make sure - everything works. This test may take 20 minutes. - -6. Add the following list to the Eclipse project's .classpath file: - - - - - - - - - - - - - - - - - - - - - - - - - - - - -7. Develop using Eclipse. - - -Development Tips ------------------------- -* You may change the first line in conf/hive-log4j.properties to the following line to see error messages on the console. -hive.root.logger=INFO,console -Otherwise you will see error messages in /tmp/ -* You may use the following line to test a specific testcase with a specific query file. -ant -Dtestcase=TestParse -Dqfile=udf4.q test -ant -Dtestcase=TestParseNegative -Dqfile=invalid_dot.q test -ant -Dtestcase=TestCliDriver -Dqfile=udf1.q test -ant -Dtestcase=TestNegativeCliDriver -Dqfile=invalid_tbl_name.q test Property changes on: README.txt ___________________________________________________________________ Added: svn:mergeinfo