Index: src/docs/src/documentation/content/xdocs/supportedformats.xml =================================================================== --- src/docs/src/documentation/content/xdocs/supportedformats.xml (revision 1300723) +++ src/docs/src/documentation/content/xdocs/supportedformats.xml (working copy) @@ -22,8 +22,10 @@
HCatalog can read PigStorage and RCFile formatted files. The input drivers for the formats are PigStorageInputDriver and RCFileInputDriver respectively. HCatalog currently produces only RCFile formatted output. The output driver for the same is RCFileOutputDriver.
+As of version 0.4, HCatalog uses Hive's SerDe class to serialize and deserialize data. SerDes are provided for RCFile, CSV text, JSON text, and SequenceFile formats.
-Hive and HCatalog applications can interoperate (each can read the output of the other) as long as they use a common format. Currently, the only common format is RCFile.
+Users can write SerDes for custom formats using the instructions at https://cwiki.apache.org/confluence/display/Hive/SerDe.
+ + Index: src/docs/src/documentation/content/xdocs/dynpartition.xml =================================================================== --- src/docs/src/documentation/content/xdocs/dynpartition.xml (revision 1300723) +++ src/docs/src/documentation/content/xdocs/dynpartition.xml (working copy) @@ -27,7 +27,7 @@In earlier versions of HCatalog, to read data users could specify that they were interested in reading from the table and specify various partition key/value combinations to prune, as if specifying a SQL-like where clause. However, to write data the abstraction was not as seamless. We still required users to write out data to the table, partition-by-partition, but these partitions required fine-grained knowledge of which key/value pairs they needed. We required this knowledge in advance, and we required the user to have already grouped the requisite data accordingly before attempting to store.
+When writing data in HCatalog it is possible to write all records to a single partition. In this case the partition column(s) need not be in the output data.
The following Pig script illustrates this:
This approach had a major issue. MapReduce programs and Pig scripts needed to be aware of all the possible values of a key, and these values needed to be maintained and/or modified when new values were introduced. With more partitions, scripts began to look cumbersome. And if each partition being written launched a separate HCatalog store, we were increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions.
+In cases where you want to write data to multiple partitions simultaneously, this can be done by placing partition columns in the data and not specifying partition values when storing the data.
-A better approach is to have HCatalog determine all the partitions required from the data being written. This would allow us to simplify the above script into the following:
-The way dynamic partitioning works is that HCatalog locates partition columns in the data passed to it and uses the data in these columns to split the rows across multiple partitions. (The data passed to HCatalog must have a schema that matches the schema of the destination table and hence should always contain partition columns.) It is important to note that partition columns can’t contain null values or the whole process will fail. It is also important note that all partitions created during a single run are part of a transaction and if any part of the process fails none of the partitions will be added to the table.
+The way dynamic partitioning works is that HCatalog locates partition columns in the data passed to it and uses the data in these columns to split the rows across multiple partitions. (The data passed to HCatalog must have a schema that matches the schema of the destination table and hence should always contain partition columns.) It is important to note that partition columns can’t contain null values or the whole process will fail.
+It is also important to note that all partitions created during a single run are part of a transaction and if any part of the process fails none of the partitions will be added to the table.
On the other hand, if there is data that spans more than one partition, then HCatOutputFormat will automatically figure out how to spray the data appropriately.
-For example, let's say a=1 for all values across our dataset and b takes the value 1 and 2. Then the following statement...
+For example, let's say a=1 for all values across our dataset and b takes the values 1 and 2. Then the following statement...
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
-With dynamic partition, we simply specify only as many keys as we know about, or as required. It will figure out the rest of the keys by itself and spray out necessary partitions, being able to create multiple partitions with a single job.
+With dynamic partitioning, we simply specify only as many keys as we know about, or as required. It will figure out the rest of the keys by itself and spray out necessary partitions, being able to create multiple partitions with a single job.
- - -Dynamic partitioning potentially results in a large number of files and more namenode load. To address this issue, we utilize HAR to archive partitions after writing out as part of the HCatOutputCommitter action. Compaction is disabled by default. To enable compaction, use the Hive parameter hive.archive.enabled, specified in the client side hive-site.xml. The current behavior of compaction is to fail the entire job if compaction fails.
-Prerequisites
Throughout these instructions when you see a word in italics it - indicates a place where you should replace the word with a + indicates a place where you should replace the word with an appropriate value such as a hostname or password.
Thrift Server Install
@@ -66,7 +66,7 @@ machine as the Thrift server. For large clusters we recommend that they not be the same machine. For the purposes of these instructions we will refer to this machine as - hcatdb.acme.com + hcatdb.acme.com.Install MySQL server on hcatdb.acme.com. You can obtain
packages for MySQL from MySQL's
@@ -85,7 +85,7 @@
Thrift server config Thrift Server Configuration Now you need to edit your Server activity logs and gc logs are located in
+ Server activity logs are located in
Prerequisites Throughout these instructions when you see a word in italics it
indicates a place where you should replace the word with a locally
appropriate value such as a hostname or password. Building a tarball If you downloaded HCatalog from Apache or another site as a source release,
+ you will need to first build a tarball to install. You can tell if you have
+ a source release by looking at the name of the object you downloaded. If
+ it is named hcatalog-src-0.4.0-incubating.tar.gz (notice the
+ src in the name) then you have a source release. If you do not already have Apache Ant installed on your machine, you
+ will need to obtain it. You can get it from the
+ Apache Ant website. Once you download it, you will need to unpack it
+ somewhere on your machine. The directory wheryou unpack it will be referred
+ to as ant_home in this document. If you do not already have Apache Forrest installed on your machine, you
+ will need to obtain it. You can get it from the
+ Apache Forrest website. Once you download it, you will need to unpack
+ it somewhere on your machine. The directory where you unpack it will be referred
+ to as forrest_home in this document. To produce a tarball from this do the following: Create a directory to expand the source release in. Copy the source
+ release to that directory and unpack it. Change directories into the unpacked source release and build the
+ installation tarball. ant_home The tarball for installation should now be at
+ Database Setup Select a machine to install the database on. This need not be the same
@@ -65,12 +104,12 @@
In a temporary directory, untar the HCatalog artifact In a temporary directory, untar the HCatalog installation tarball. Use the database installation script found in the package to create the
- databasemysql> quit;mysql -u hive -D hivemetastoredb -hhcatdb.acme.com -p < /usr/share/hcatalog/scripts/hive-schema-0.7.0.mysql.sql/etc/hcatalog/hive-site.xml file.
Open this file in your favorite text editor. The following table shows the
values you need to configure.
hive.metastore.uris
- You need to set the hostname to your Thrift
- server. Replace SVRHOST with the name of the
+ Set the hostname of your Thrift
+ server by replacing SVRHOST with the name of the
machine you are installing the Thrift server on.
hive.metastore.sasl.enabled
- Set to false by default. Set to true if its a secure environment.
+ Set to false by default. Set to true if it is a secure environment.
hive.metastore.kerberos.keytab.file
- The path to the Kerberos keytab file containg the metastore
- thrift server's service principal. Need to set only in secure enviroment.
+ The path to the Kerberos keytab file containing the metastore
+ Thrift server's service principal. Need to set only in secure enviroment.
@@ -142,13 +142,13 @@
hive.metastore.kerberos.principal
- The service principal for the metastore thrift server. You can
+ The service principal for the metastore Thrift server. You can
reference your host as _HOST and it will be replaced with
- actual hostname. Need to set only in secure environment.
+ the actual hostname. Need to set only in secure environment.
sudo service start hcatalog-serversudo service hcatalog-server start/var/log/hcat_server. Logging configuration is located at
/etc/hcatalog/log4j.properties. Server logging uses
DailyRollingFileAppender by default. It will generate a new
@@ -158,7 +158,7 @@
sudo service stop hcatalog-serversudo service hcatalog-server stop
hive.metastore.uris
- You need to set the hostname wish your Thrift
- server to use by replacing SVRHOST with the name of the
+ Set the hostname of your Thrift
+ server by replacing SVRHOST with the name of the
machine you are installing the Thrift server on.
hive.metastore.sasl.enabled
- Set to false by default. Set to true if its a secure environment.
+ Set to false by default. Set to true if it is a secure environment.
Index: src/docs/src/documentation/content/xdocs/install.xml
===================================================================
--- src/docs/src/documentation/content/xdocs/install.xml (revision 1300723)
+++ src/docs/src/documentation/content/xdocs/install.xml (working copy)
@@ -24,23 +24,62 @@
hive.metastore.kerberos.principal
- The service principal for the metastore thrift server. You can
+ The service principal for the metastore Thrift server. You can
reference your host as _HOST and it will be replaced with
actual hostname. Need to set only in secure environment.
+
mkdir /tmp/hcat_source_releasecp hcatalog-src-0.4.0-incubating.tar.gz /tmp/hcat_source_releasecd /tmp/hcat_source_releasetar xzf hcatalog-src-0.4.0-incubating.tar.gzcd hcatalog-src-0.4.0-incubating/bin/ant -Dhcatalog.version=0.4.0
+ -Dforrest.home=forrest_home tar build/hcatalog-0.4.0.tar.gzmysql> flush privileges;mysql> quit;tar xzf hcatalog-version.tar.gztar xzf hcatalog-0.4.0.tar.gz
mysql -u hive -D hivemetastoredb -hhcatdb.acme.com -p < share/hcatalog/hive/external/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql
Thrift Server Setup
@@ -95,7 +134,7 @@ directory must be owned by the hcat user. We recommend/usr/local/hcat. If necessary, create the directory.
- Download the HCatalog release into a temporary directory, and untar +
Copy the HCatalog installation tarball into a temporary directory, and untar it. Then change directories into the new distribution and run the HCatalog server installation script. You will need to know the directory you chose as root and the @@ -105,8 +144,8 @@ the port number you wish HCatalog to operate on which you will use to set portnum.
-tar zxf hcatalog-version.tar.gz
- cd hcatalog-version
tar zxf hcatalog-0.4.0.tar.gz
cd hcatalog-0.4.0
share/hcatalog/scripts/hcat_server_install.sh -r root -d dbroot -h hadoop_home -p portnum
Now you need to edit your root/etc/hcatalog/hive-site.xml file.
@@ -126,20 +165,20 @@
-- If default hdfs was specified in core-site.xml, path resolves to HDFS location.
-- Otherwise, path is resolved as local file: URI.
-This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).
+This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).
Server activity logs and gc logs are located in +
Server activity logs are located in
root/var/log/hcat_server. Logging configuration is located at
root/conf/log4j.properties. Server logging uses
DailyRollingFileAppender by default. It will generate a new
@@ -211,7 +250,7 @@
Select a root directory for your installation of HCatalog client.
We recommend /usr/local/hcat. If necessary, create the directory.
Download the HCatalog release into a temporary directory, and untar +
Copy the HCatalog installation tarball into a temporary directory, and untar it.
tar zxf hcatalog-version.tar.gz
-- If default hdfs was specified in core-site.xml, path resolves to HDFS location.
-- Otherwise, path is resolved as local file: URI.
-This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).
+This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).
In HCatalog 2.0 we introduce notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:
+HCatalog 0.2 provides notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:
No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message. This means that your existing code, when running with 0.2, will automatically send the notifications.
+No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message.
2. Subscribe to a topic you are interested in. When subscribing on a message bus, you need to subscribe to a particular topic to receive the messages that are being delivered on that topic.
The topic name corresponding to a particular table is stored in table properties and can be retrieved using following piece of code:
+The topic name corresponding to a particular table is stored in table properties and can be retrieved using the following piece of code:
You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog uses ActiveMQ as a JMS provider. In principle, any JMS provider can be used in client side; however, ActiveMQ is recommended. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html
+You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog is tested with ActiveMQ as a JMS provider, although any JMS provider can be used. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html .
Sometimes a user wants to wait until a collection of partitions is finished. For example, you may want to start processing after all partitions for a day are done. However, HCatalog has no notion of collections or hierarchies of partitions. To support this, HCatalog allows data writers to signal when they are finished writing a collection of partitions. Data readers may wait for this signal before beginning to read.
+The example code below illustrates how to send a notification when a set of partitions has been added.
+To signal, a data writer does this:
+