+This is a brief example on creating a populating bucketed tables. Bucketed tables
+are fantastic in that they allow much more efficient sampling than do non-bucketed
+tables, and they may later allow for time saving operations such as mapside joins.
+However, the bucketing specified at table creation is not enforced when the table
+is written to, and so it is possible for the table's metadata to advertise
+properties which are not upheld by the table's actual layout. This should obviously
+be avoided. Here's how to do it right.
+
+
First there’s table creation:
+
+
+
+
notice that we define user_id as the bucket column
+
+
+
+
+
+
+
The command set hive.enforce.bucketing = true; allows the
+correct number of reducers and the cluster by column to be automatically selected
+based on the table. Otherwise, you would need to set the number of reducers to be
+the same as the number of buckets with
+set mapred.reduce.tasks = 256; and have a
+CLUSTER BY ... clause in the select.
+
+
+
+
+
+How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
+
+
+
+
+
+
+So, what can go wrong? As long as you
+set hive.enforce.bucketing = true, and use the syntax above,
+the tables should be populated properly. Things can go wrong if the bucketing
+column type is different during the insert and on read, or if you manually
+cluster by a value that's different from the table definition.
+
+CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error.
+
+
+
+The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
+
+The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).
+
+
+You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe. Refer to SerDe section of the User Guide for more information on SerDe.
+
+
+
+You must specify a list of a columns for tables that use a native SerDe. Refer to the Types part of the User Guide for the allowable column types. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
+
+
+
+Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
+
+
+
+Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
+
+
+
+Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
+
+
+
A create table example:
+
+
+
The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
+
+
+
+
+
+
+
+ Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
+
+
+
+ There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
+
+
+
+
+
+
+
+
+
+This example CTAS statement creates the target table new_key_value_store with the
+schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the
+SELECT statement. If the SELECT statement does not specify column aliases, the
+column names will be automatically assigned to _col0, _col1, and _col2 etc.
+In addition, the new target table is created using a specific SerDe and a storage
+format independent of the source tables in the SELECT statement.
+
+
+
+
+
+Being able to select data from one table to another is one of the most
+powerful features of Hive. Hive handles the conversion of the data from the source
+format to the destination format as the query is being executed!
+
+
+
+
+
+
+
+
+
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
+
+
+
+The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query. See
+Working with Bucketed tables to see how these
+are used.
+
+
+
+
+
+
+
+Unless a table is specified as EXTERNAL it will be stored inside a folder specified by the
+configuration property hive.metastore.warehouse.dir.
+EXTERNAL tables points to any hdfs location for its storage. You still have to make sure that the data is format is specified to match the data.
+
+
+';
+ ]]>
+
+
+
+
+
+
The statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.
+
The Hive Web Interface (HWI) is an alternative to the Command Line Interface. It is best used
+when providing CLI (shell) access to users is either difficult. HWI is also a simple way to run deteacted sessions. Users can
+authenticate to HWI and submit queries. Unlike the CLI a query will continue to run after a browser window is closed. The user
+can return back later to view the status of the query or review the results.
+
+
+
+
+Features of the CLI such as ADD FILE or ADD JAR are not able to access files on the local users computer. This is because
+the hive session lives on the web server. However if the file
+
+
+
Index: xdocs/velocity.properties
===================================================================
--- xdocs/velocity.properties (revision 0)
+++ xdocs/velocity.properties (revision 0)
@@ -0,0 +1 @@
+#
Index: xdocs/site.css
===================================================================
--- xdocs/site.css (revision 0)
+++ xdocs/site.css (revision 0)
@@ -0,0 +1,305 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+/** defined standard tags **/
+body {
+ background-color: #ffffff;
+ color: #000000;
+}
+
+a:link, a:active, a:visited {
+ color: #525D76;
+}
+
+
+h1 {
+ background-color: #525D76;
+ color: #ffffff;
+ font-family: arial,helvetica,sanserif;
+ font-size: large;
+ padding-left:2px;
+}
+
+h2 {
+ background-color: #828DA6;
+ color: #ffffff;
+ font-family: arial,helvetica,sanserif;
+ font-size: medium;
+ padding-left:2px;
+}
+
+table {
+ border: none;
+ border-spacing:0px;
+ border-collapse: collapse;
+}
+
+img {
+ border: none 0px;
+}
+
+/** define layout **/
+
+/** table used to force footer to end of page **/
+table#layout {
+ width:100%;
+}
+
+table#layout td {
+ padding:0px;
+}
+
+div#container {
+ width: 95%;
+ margin: 10px;
+ margin-left: 0;
+ margin-right: auto;
+ padding: 10px;
+}
+
+div#header {
+ padding: 5px;
+ margin: 0px;
+ margin-top:5px;
+ margin-bottom:5px;
+ height:80px;
+ border-bottom: 1px solid #333333;
+}
+
+div#menu {
+ float: left;
+ width: 200px;
+ margin: 0;
+ margin-left: 0px;
+ margin-right: 5px;
+
+ /** little higher margin since it doesn't start with a header **/
+ margin-top:10px;
+ margin-bottom:0px;
+
+ padding: 5px;
+}
+
+div#body {
+ margin-right:0px;
+ margin-left: 215px;
+ margin-top:5px;
+ margin-bottom:0px;
+
+ padding: 5px;
+
+}
+
+div#footer {
+
+ clear: both;
+
+ padding-top:15px;
+ margin-top:25px;
+ border-top: 1px solid #333333;
+
+
+ text-align:center;
+ color: #525D76;
+ font-style: italic;
+ font-size: smaller;
+}
+
+div#logo1 {
+ float:left;
+ margin-left:5px;
+ margin-top:10px;
+}
+
+
+div#logo2 {
+ float:right;
+ margin-top:10px;
+}
+
+
+/** define body tag redefinitions **/
+
+
+div#body th {
+ background-color: #039acc;
+ color: #000000;
+ font-family: arial,helvetica,sanserif;
+ font-size: smaller;
+ vertical-align: top;
+ text-align:left;
+ border:1px #FFFFFF solid;
+ padding: 2px;
+}
+
+div#body td {
+ background-color: #a0ddf0;
+ color: #000000;
+ font-family: arial,helvetica,sanserif;
+ font-size: smaller;
+ vertical-align: top;
+ text-align:left;
+ border:1px #FFFFFF solid;
+ padding: 2px;
+}
+
+
+div#body li {
+ margin-top:3px;
+}
+
+/** define other body styles **/
+
+div.section {
+ margin-left: 25px;
+}
+
+div.subsection {
+ margin-left: 25px;
+}
+
+div.source {
+ margin-left:25px;
+ margin-top:20px;
+ margin-bottom:20px;
+ padding-left:4px;
+ padding-right:4px;
+ padding-bottom:4px;
+ padding-top:5px;
+
+ width:600;
+
+ border: 1px solid #333333;
+ background-color: #EEEEEE;
+ color: #333333;
+
+ /** bug: puts a extra line before the block in IE and after the block in FireFox **/
+ white-space: pre;
+
+ font-family: Courier;
+ font-size: smaller;
+ text-align: left;
+
+ overflow:auto;
+}
+
+
+div.license {
+ margin-left:0px;
+ margin-top:20px;
+ margin-bottom:20px;
+ padding:5px;
+
+ border: 1px solid #333333;
+ background-color: #EEEEEE;
+ color: #333333;
+
+ text-align: left;
+}
+
+/** define menu styles **/
+
+div.menusection {
+ margin-bottom:10px;
+}
+
+.menuheader {
+ font-weight:bold;
+ margin-bottom:0px;
+}
+
+div.menusection ul {
+ margin-top:5px;
+
+}
+div.menusection li {
+
+}
+
+
+
+
+/** printing **/
+@page Section1
+ {
+ size:8.5in 11.0in;
+ margin:1.0in .75in 1.0in .75in;
+}
+
+@media print {
+
+ /** make sure this fits the page **/
+ div#container {
+ width:100%;
+ min-height:0px;
+ }
+
+
+ div#menu {
+ display:none;
+ }
+
+ div#header {
+ display:none;
+ }
+
+ div#body {
+ margin-left:5px;
+ }
+
+
+ div.source {
+ width:95%;
+ margin-left:0px;
+ }
+
+ /** make a bit more room on the page **/
+ div.section {
+ margin-left: 0px;
+ }
+
+ div.subsection {
+ margin-left: 0px;
+ }
+
+ h1 {
+ background-color: #FFFFFF;
+ color: #000000;
+ }
+
+ h2 {
+ background-color: #FFFFFF;
+ color: #000000;
+ }
+
+ div#body td {
+ background-color: #FFFFFF;
+ color: #000000;
+ border: #333333 1px solid;
+ }
+
+ div#body th {
+ background-color: #FFFFFF;
+ color: #000000;
+ border: #333333 1px solid;
+ font-style:bold;
+ }
+
+}
Index: xdocs/stylesheets/project.xml
===================================================================
--- xdocs/stylesheets/project.xml (revision 0)
+++ xdocs/stylesheets/project.xml (revision 0)
@@ -0,0 +1,54 @@
+
+
+
+
+
+ Hadoop Hive
+ Hadoop Hive
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Index: xdocs/stylesheets/site.vsl
===================================================================
--- xdocs/stylesheets/site.vsl (revision 0)
+++ xdocs/stylesheets/site.vsl (revision 0)
@@ -0,0 +1,317 @@
+## Licensed to the Apache Software Foundation (ASF) under one
+## or more contributor license agreements. See the NOTICE file
+## distributed with this work for additional information
+## regarding copyright ownership. The ASF licenses this file
+## to you under the Apache License, Version 2.0 (the
+## "License"); you may not use this file except in compliance
+## with the License. You may obtain a copy of the License at
+##
+## http://www.apache.org/licenses/LICENSE-2.0
+##
+## Unless required by applicable law or agreed to in writing,
+## software distributed under the License is distributed on an
+## "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+## KIND, either express or implied. See the License for the
+## specific language governing permissions and limitations
+## under the License.
+
+
+
+
+
+#document()
+
+
+## This is where the macro's live
+
+#macro ( table $table)
+