diff --git README.txt README.txt index b46ffd4..93ece80 100644 --- README.txt +++ README.txt @@ -1,379 +1,105 @@ -How to Hive ------------ +Apache Hive 0.6.0 +================= -DISCLAIMER: This is a prototype version of Hive and is NOT production -quality. This is provided mainly as a way of illustrating the -capabilities of Hive and is provided as-is. However - we are working -hard to make Hive a production quality system. Hive has only been -tested on Unix(Linux) and Mac OS X using Java 1.6 for now - although -it may very well work on other similar platforms. It does not work on -Cygwin. Most of our testing has been on Hadoop 0.17 - so we would -advise running it against this version of hadoop - even though it may -compile/work against other versions +Hive is a data warehouse system for Hadoop that facilitates +easy data summarization, ad-hoc querying and analysis of large +datasets stored in Hadoop compatible file systems. Hive provides a +mechanism to put structure on this data and query the data using a +SQL-like language called HiveQL. At the same time this language also +allows traditional map/reduce programmers to plug in their custom +mappers and reducers when it is inconvenient or inefficient to express +this logic in HiveQL. -Useful mailing lists --------------------- - -1. hive-user@hadoop.apache.org - To discuss and ask usage - questions. Send an empty email to - hive-user-subscribe@hadoop.apache.org in order to subscribe to this - mailing list. - -2. hive-core@hadoop.apache.org - For discussions around code, design - and features. Send an empty email to - hive-core-subscribe@hadoop.apache.org in order to subscribe to this - mailing list. - -3. hive-commits@hadoop.apache.org - In order to monitor the commits to - the source repository. Send an empty email to - hive-commits-subscribe@hadoop.apache.org in order to subscribe to - this mailing list. - -Downloading and building ------------------------- - -- svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive_trunk -- cd hive_trunk -- hive_trunk> ant package -Dtarget.dir= -Dhadoop.version=0.17.2.1 - -If you do not specify a value for target.dir it will use the default -value build/dist. Then you can run the unit tests if you want to make -sure it is correctly built: - -- hive_trunk> ant test -Dtarget.dir= -Dhadoop.version=0.17.2.1 \ - -logfile test.log - -You can replace 0.17.2.1 with 0.18.3, 0.19.0, etc. to match the -version of hadoop that you are using. If you do not specify -hadoop.version it will use the default value specified in -build.properties file. - - -In the rest of the README, we use dist and interchangeably. - -Running Hive ------------- - -Hive uses hadoop that means: -- you must have hadoop in your path OR -- export HADOOP=/bin/hadoop - -To use hive command line interface (cli) from the shell: -$ cd -$ bin/hive - -Using Hive ----------- - -Configuration management overview ---------------------------------- - -- hive configuration is stored in /conf/hive-default.xml - and log4j in hive-log4j.properties -- hive configuration is an overlay on top of hadoop - meaning the - hadoop configuration variables are inherited by default. -- hive configuration can be manipulated by: - o editing hive-default.xml and defining any desired variables - (including hadoop variables) in it - o from the cli using the set command (see below) - o by invoking hive using the syntax: - * bin/hive -hiveconf x1=y1 -hiveconf x2=y2 - this sets the variables x1 and x2 to y1 and y2 - - -Error Logs ----------- -Hive uses log4j for logging. By default logs are not emitted to the -console by the cli. They are stored in the file: -- /tmp/{user.name}/hive.log - -If the user wishes - the logs can be emitted to the console by adding -the arguments shown below: -- bin/hive -hiveconf hive.root.logger=INFO,console - -Note that setting hive.root.logger via the 'set' command does not -change logging properties since they are determined at initialization time. - -Error logs are very useful to debug problems. Please send them with -any bugs (of which there are many!) to the JIRA at -https://issues.apache.org/jira/browse/HIVE. - - -DDL Operations --------------- - -Creating Hive tables and browsing through them - -hive> CREATE TABLE pokes (foo INT, bar STRING); - -Creates a table called pokes with two columns, first being an -integer and other a string columns - -hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); - -Creates a table called invites with two columns and a partition column -called ds. The partition column is a virtual column. It is not part -of the data itself, but is derived from the partition that a -particular dataset is loaded into. - - -By default tables are assumed to be of text input format and the -delimiters are assumed to be ^A(ctrl-a). We will be soon publish -additional commands/recipes to add binary (sequencefiles) data and -configurable delimiters etc. - -hive> SHOW TABLES; - -lists all the tables - -hive> SHOW TABLES '.*s'; - -lists all the table that end with 's'. The pattern matching follows Java -regular expressions. Check out this link for documentation -http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html - -hive> DESCRIBE invites; - -shows the list of columns - -hive> DESCRIBE EXTENDED invites; - -shows the list of columns plus any other meta information about the table - -Altering tables. Table name can be changed and additional columns can be dropped - -hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); -hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); -hive> ALTER TABLE pokes REPLACE COLUMNS (c1 INT, c2 STRING); -hive> ALTER TABLE events RENAME TO 3koobecaf; - -Dropping tables -hive> DROP TABLE pokes; +Please note that Hadoop is a batch processing system and Hadoop jobs +tend to have high latency and incur substantial overheads in job +submission and scheduling. Consequently the average latency for Hive +queries is generally very high (minutes) even when data sets involved +are very small (say a few hundred megabytes). As a result it cannot be +compared with systems such as Oracle where analyses are conducted on a +significantly smaller amount of data but the analyses proceed much +more iteratively with the response times between iterations being less +than a few minutes. Hive aims to provide acceptable (but not optimal) +latency for interactive data browsing, queries over small data sets or +test queries. +Hive is not designed for online transaction processing and does not +support real-time queries or row level insert/updates. It is best used +for batch jobs over large sets of immutable data (like web logs). What +Hive values most are scalability (scale out with more machines added +dynamically to the Hadoop cluster), extensibility (with MapReduce +framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with +its input formats. -Metadata Store --------------- -Metadata is in an embedded Derby database whose location is determined by the -hive configuration variable named javax.jdo.option.ConnectionURL. By default -(see conf/hive-default.xml) - this location is ./metastore_db +General Info +============ -Right now, in the default configuration, this metadata can only be seen by -one user at a time. +For the latest information about Hive, please visit out website at: -Metastore can be stored in any database that is supported by JPOX. The -location and the type of the RDBMS can be controlled by the two variables -'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. -Refer to JDO (or JPOX) documentation for more details on supported databases. -The database schema is defined in JDO metadata annotations file package.jdo -at src/contrib/hive/metastore/src/model. + http://hive.apache.org/ -In the future, the metastore itself can be a standalone server. +and our wiki at: + http://wiki.apache.org/hadoop/Hive/ -DML Operations --------------- -Loading data from flat files into Hive +Getting Started +=============== -hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; +- Installation Instructions and a quick tutorial: + http://wiki.apache.org/hadoop/Hive/GettingStarted -Loads a file that contains two columns separated by ctrl-a into pokes -table. 'local' signifies that the input file is on the local -system. If 'local' is omitted then it looks for the file in HDFS. +- A longer tutorial that covers more features of HiveQL: + http://wiki.apache.org/hadoop/Hive/Tutorial -The keyword 'overwrite' signifies that existing data in the table is -deleted. If the 'overwrite' keyword is omitted - then data files are -appended to existing data sets. +- The HiveQL Language Manual: + http://wiki.apache.org/hadoop/Hive/LanguageManual -NOTES: -- NO verification of data against the schema +Requirements +============ -- If the file is in hdfs it is moved into hive controlled file system - namespace. The root of the hive directory is specified by the - option hive.metastore.warehouse.dir in hive-default.xml. We would - advise that this directory be pre-existing before trying to create - tables via Hive. +- Java 1.6 -hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' \ - OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); +- Hadoop 0.17, 0.18, 0.19, or 0.20. -hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' \ - OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); + *NOTE*: We strongly recommend that you use Hadoop 0.20 + since the majority of our testing is done against this + version and because support for pre-0.20 versions of + Hadoop will be dropped in Hive 0.7. -The two LOAD statements above load data into two different partitions -of the table invites. Table invites must be created as partitioned by -the key ds for this to succeed. -Loading/Extracting data using Queries -------------------------------------- +Upgrading from older versions of Hive +===================================== -Runtime configuration ---------------------- +- Hive 0.6.0 includes changes to the MetaStore schema. If + you are upgrading from an earlier version of Hive it is + imperative that you upgrade the MetaStore schema by + running the appropriate schema upgrade script located in + the scripts/metastore/upgrade directory. -- Hives queries are executed using map-reduce queries and as such the - behavior of such queries can be controlled by the hadoop - configuration variables + We have provided upgrade scripts for Derby, MySQL, and PostgreSQL + databases. If you are using a different database for your MetaStore + you will need to provide your own upgrade script. -- The cli can be used to set any hadoop (or hive) configuration - variable. For example: +- Hive 0.6.0 includes new configuration properties. If you + are upgrading from an earlier version of Hive it is imperative + that you replace all of the old copies of the hive-default.xml + configuration file with the new version located in the conf/ + directory. - hive> SET mapred.job.tracker=myhost.mycompany.com:50030 - hive> SET - v - The latter shows all the current settings. Without the v option only - the variables that differ from the base hadoop configuration are - displayed - -- In particular the number of reducers should be set to a reasonable - number to get good performance (the default is 1!) - - -EXAMPLE QUERIES ---------------- - -Some example queries are shown below. More are available in the hive -code: ql/src/test/queries/{positive,clientpositive}. - -SELECTS and FILTERS -------------------- - -hive> SELECT a.foo FROM invites a; - -Select column 'foo' from all rows of invites table. The results are -not stored anywhere, but are displayed on the console. - -Note that in all the examples that follow, INSERT (into a hive table, -local directory or HDFS directory) is optional. - -hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; - -Select all rows from invites table into an HDFS directory. The result -data is in files (depending on the number of mappers) in that -directory. - -NOTE: partition columns if any are selected by the use of *. They can -also be specified in the projection clauses. - -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; - -Select all rows from pokes table into a local directory - -hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; -hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; -hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; -hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; - -Sum of a column. avg, min, max can also be used - -NOTE: there are some flaws with the type system that cause doubles to -be returned with integer types would be expected. We expect to fix -these in the coming week. - -GROUP BY --------- - -hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; -hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; - -NOTE: Currently Hive always uses two stage map-reduce for groupby -operation. This is to handle skews in input data. We will be -optimizing this in the coming weeks. - -JOIN ----- - -hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo - -MULTITABLE INSERT ------------------ -FROM src -INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 -INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 -INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 -INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 - -STREAMING ---------- -hive> FROM invites a INSERT OVERWRITE TABLE events - > MAP a.foo, a.bar USING '/bin/cat' - > AS oof, rab WHERE a.ds > '2008-08-09'; - -This streams the data in the map phase through the script /bin/cat -(like hadoop streaming). Similarly, streaming can be used on the -reduce side. Please look for files mapreduce*.q. - -KNOWN BUGS/ISSUES ------------------ - -* Hive cli may hang for a couple of minutes because of a bug in - getting metadata from the derby database. let it run and you'll be - fine! - -* Hive cli creates derby.log in the directory from which it has been - invoked. - -* COUNT(*) does not work for now. Use COUNT(1) instead. - -* ORDER BY not supported yet. - -* CASE not supported yet. - -* Only string and Thrift types (http://developers.facebook.com/thrift) - have been tested. - -* When doing Join, please put the table with big number of rows - containing the same join key to the rightmost in the JOIN - clause. Otherwise we may see OutOfMemory errors. - -FUTURE FEATURES ---------------- -* EXPLODE function to generate multiple rows from a column of list type. -* Table statistics for query optimization. - -Developing Hive using Eclipse ------------------------------ -1. Follow the 3 steps in "Downloading and building" section above - -2. Generate test files for eclipse: - - hive_trunk> cd metastore; ant model-jar; cd .. - hive_trunk> cd ql; ant gen-test; cd .. - hive_trunk> ant eclipse-files - - This will generate .project, .settings, .externalToolBuilders, - .classpath, and a set of *.launch files for Eclipse-based - debugging. - -3. Launch Eclipse and import the Hive project by navigating to - File->Import->General->Existing Projects menu and selecting - the hive_trunk directory. - -4. Run the CLI or a set of test cases by right clicking on one of the - *.launch configurations and selection Run or Debug. - -Enabling Checkstyle Plugin in Eclipse -------------------------------------- -1. Follow the steps in "Developing Hive using Eclipse" to import - Hive project in your Eclipse workbench. +Useful mailing lists +==================== -2. If you do not have Checkstyle plugin for eclipse, install it - before proceeding to next step. For more information refer to - http://eclipse-cs.sourceforge.net/. +1. user@hive.apache.org - To discuss and ask usage questions. Send an + empty email to user-subscribe@hive.apache.org in order to subscribe + to this mailing list. -3. In the package navigator, select the hive project, right-click - and select Checkstyle > Activate Checkstyle. This will cause - the checkstyle plugin to activate and analyze the project sources. - +2. dev@hive.apache.org - For discussions about code, design and features. + Send an empty email to dev-subscribe@hive.apache.org in order to subscribe + to this mailing list. -Development Tips ------------------------- -* You may use the following line to test a specific testcase with a specific query file. -ant -Dhadoop.version='0.17.0' -Dtestcase=TestParse -Dqfile=udf4.q test -ant -Dhadoop.version='0.17.0' -Dtestcase=TestParseNegative -Dqfile=invalid_dot.q test -ant -Dhadoop.version='0.17.0' -Dtestcase=TestCliDriver -Dqfile=udf1.q test -ant -Dhadoop.version='0.17.0' -Dtestcase=TestNegativeCliDriver -Dqfile=invalid_tbl_name.q test +3. commits@hive.apache.org - In order to monitor commits to the source + repository. Send an empty email to commits-subscribe@hive.apache.org + in order to subscribe to this mailing list.