Index: src/docs/src/documentation/content/xdocs/inputoutput.xml =================================================================== --- src/docs/src/documentation/content/xdocs/inputoutput.xml (revision 1301023) +++ src/docs/src/documentation/content/xdocs/inputoutput.xml (working copy) @@ -28,60 +28,60 @@
No HCatalog-specific setup is required for the HCatInputFormat and HCatOutputFormat interfaces.
-Authentication
-If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. |
-
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog managed tables.
-HCatInputFormat exposes a new Hadoop 20 MapReduce API for reading data as if it had been published to a table. If a MapReduce job uses this InputFormat to write output, the default InputFormat configured for the table is used as the underlying InputFormat and the new partition is published to the table after the job completes. Also, the maximum number of partitions that a job can work on is limited to 100K.
+HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.
The API exposed by HCatInputFormat is shown below.
-To use HCatInputFormat to read data, first instantiate a HCatTableInfo with the necessary information from the table being read
- and then call setInput on the HCatInputFormat.
To use HCatInputFormat to read data, first instantiate as InputJobInfo with the necessary information from the table being read
+ and then call setInput with the InputJobInfo.
You can use the setOutputSchema method to include a projection schema, to specify specific output fields. If a schema is not specified, this default to the table level schema.
You can use the setOutputSchema method to include a projection schema, to
+specify specific output fields. If a schema is not specified all the columns in the table
+will be returned.
You can use the getTableSchema methods to determine the table schema for a specified input table.
HCatOutputFormat is used with MapReduce jobs to write data to HCatalog managed tables.
-HCatOutputFormat exposes a new Hadoop 20 MapReduce API for writing data to a table. If a MapReduce job uses this OutputFormat to write output, the default OutputFormat configured for the table is used as the underlying OutputFormat and the new partition is published to the table after the job completes.
+HCatOutputFormat exposes a Hadoop 20 MapReduce API for writing data to a table. + When a MapReduce job uses HCatOutputFormat to write output, the default OutputFormat configured for the table is used and the new partition is published to the table after the job completes.
The first call on the HCatOutputFormat must be setOutput; any other call will throw an exception saying the output format is not initialized. The schema for the data being written out is specified by the setSchema method. You must call this method, providing the schema of data you are writing. If your data has same schema as table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema().
The partition schema specified can be different from the current table level schema. The rules about what kinds of schema are allowed are:
- -Running MapReduce with HCatalog
++Your MapReduce program will need to know where the thrift server to connect to is. The +easiest way to do this is pass it as an argument to your Java program. You will need to +pass the Hive and HCatalog jars MapReduce as well, via the -libjars argument.
+ + +Authentication
+If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. |
+
Examples
+ +
+The following very simple MapReduce program reads data from one table which it assumes to have an integer in the
+second column, and counts how many different values it sees. That is, is does the
+equivalent of select col1, count(*) from $table group by col1;.
+
Notice a number of important points about this program:
+
+1) The implementation of Map takes HCatRecord as an input and the implementation of Reduce produces it as an output.
+
+2) This example program assumes the schema of the input, but it could also retrieve the schema via
+HCatOutputFormat.getOutputSchema() and retrieve fields based on the results of that call.
+
+3) The input descriptor for the table to be read is created by calling InputJobInfo.create. It requires the database name,
+table name, and partition filter. In this example the partition filter is null, so all partitions of the table
+will be read.
+
+4) The output descriptor for the table to be written is created by calling OutputJobInfo.create. It requires the
+database name, the table name, and a Map of partition keys and values that describe the partition being written.
+In this example it is assumed the table is unpartitioned, so this Map is null.
+
To scan just selected partitions of a table, a filter describing the desired partitions can be passed to +InputJobInfo.create. This filter can contain the operators '=', '<', '>', '<=', +'>=', '<>', 'and', 'or', and 'like'. Assume for example you have a web_logs +table that is partitioned by the column datestamp. You could select one partition of the table by changing
++to +
++This filter must reference only partition columns. Values from other columns will cause the job to fail.
++To write to a single partition you can change the above example to have a Map of key value pairs that describe all +of the partition keys and values for that partition. In our example web_logs table, there is only one partition +column (datestamp), so our Map will have only one entry. Change
+to
+To write multiple partitions simultaneously you can leave the Map null, but all of the partitioning columns must be present in the data you are writing. +
+ + -HCatalog can read PigStorage and RCFile formatted files. The input drivers for the formats are PigStorageInputDriver and RCFileInputDriver respectively. HCatalog currently produces only RCFile formatted output. The output driver for the same is RCFileOutputDriver.
+As of version 0.4, HCatalog uses Hive's SerDe class to serialize and deserialize data. SerDes are provided for RCFile, CSV text, JSON text, and SequenceFile formats.
-Hive and HCatalog applications can interoperate (each can read the output of the other) as long as they use a common format. Currently, the only common format is RCFile.
+Users can write SerDes for custom formats using the instructions at https://cwiki.apache.org/confluence/display/Hive/SerDe.
+ +Prerequisites
Throughout these instructions when you see a word in italics it - indicates a place where you should replace the word with a + indicates a place where you should replace the word with an appropriate value such as a hostname or password.
Thrift Server Install
@@ -66,7 +66,7 @@ machine as the Thrift server. For large clusters we recommend that they not be the same machine. For the purposes of these instructions we will refer to this machine as - hcatdb.acme.com + hcatdb.acme.com.Install MySQL server on hcatdb.acme.com. You can obtain
packages for MySQL from MySQL's
@@ -85,7 +85,7 @@
Thrift server config Thrift Server Configuration Now you need to edit your Server activity logs and gc logs are located in
+ Server activity logs are located in
Prerequisites Throughout these instructions when you see a word in italics it
indicates a place where you should replace the word with a locally
appropriate value such as a hostname or password. Building a tarball If you downloaded HCatalog from Apache or another site as a source release,
+ you will need to first build a tarball to install. You can tell if you have
+ a source release by looking at the name of the object you downloaded. If
+ it is named hcatalog-src-0.4.0-incubating.tar.gz (notice the
+ src in the name) then you have a source release. If you do not already have Apache Ant installed on your machine, you
+ will need to obtain it. You can get it from the
+ Apache Ant website. Once you download it, you will need to unpack it
+ somewhere on your machine. The directory where you unpack it will be referred
+ to as ant_home in this document. If you do not already have Apache Forrest installed on your machine, you
+ will need to obtain it. You can get it from the
+ Apache Forrest website. Once you download it, you will need to unpack
+ it somewhere on your machine. The directory where you unpack it will be referred
+ to as forrest_home in this document. To produce a tarball from this do the following: Create a directory to expand the source release in. Copy the source
+ release to that directory and unpack it. Change directories into the unpacked source release and build the
+ installation tarball. ant_home The tarball for installation should now be at
+ Database Setup Select a machine to install the database on. This need not be the same
@@ -65,13 +104,13 @@
In a temporary directory, untar the HCatalog artifact In a temporary directory, untar the HCatalog installation tarball. Use the database installation script found in the package to create the
- database mysql> quit;mysql -u hive -D hivemetastoredb -hhcatdb.acme.com -p < /usr/share/hcatalog/scripts/hive-schema-0.7.0.mysql.sql/etc/hcatalog/hive-site.xml file.
Open this file in your favorite text editor. The following table shows the
values you need to configure.
hive.metastore.uris
- You need to set the hostname to your Thrift
- server. Replace SVRHOST with the name of the
+ Set the hostname of your Thrift
+ server by replacing SVRHOST with the name of the
machine you are installing the Thrift server on.
hive.metastore.sasl.enabled
- Set to false by default. Set to true if its a secure environment.
+ Set to true if you are using kerberos security with your Hadoop
+ cluster, false otherwise.
hive.metastore.kerberos.keytab.file
- The path to the Kerberos keytab file containg the metastore
- thrift server's service principal. Need to set only in secure enviroment.
+ The path to the Kerberos keytab file containing the metastore
+ Thrift server's service principal. Only required if you set
+ hive.metastore.sasl.enabled above to true.
@@ -142,13 +145,13 @@
hive.metastore.kerberos.principal
- The service principal for the metastore thrift server. You can
- reference your host as _HOST and it will be replaced with
- actual hostname. Need to set only in secure environment.
+ The service principal for the metastore Thrift server. You can
+ reference your host as _HOST and it will be replaced with your
+ actual hostname. Only required if you set
+ hive.metastore.sasl.enabled above to true.
sudo service start hcatalog-serversudo service hcatalog-server start/var/log/hcat_server. Logging configuration is located at
/etc/hcatalog/log4j.properties. Server logging uses
DailyRollingFileAppender by default. It will generate a new
@@ -158,7 +161,7 @@
sudo service stop hcatalog-serversudo service hcatalog-server stop
hive.metastore.uris
- You need to set the hostname wish your Thrift
- server to use by replacing SVRHOST with the name of the
+ Set the hostname of your Thrift
+ server by replacing SVRHOST with the name of the
machine you are installing the Thrift server on.
hive.metastore.sasl.enabled
- Set to false by default. Set to true if its a secure environment.
+ Set to false by default. Set to true if it is a secure environment.
Index: src/docs/src/documentation/content/xdocs/images/hcat-product.jpg
===================================================================
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Index: src/docs/src/documentation/content/xdocs/install.xml
===================================================================
--- src/docs/src/documentation/content/xdocs/install.xml (revision 1301023)
+++ src/docs/src/documentation/content/xdocs/install.xml (working copy)
@@ -24,23 +24,62 @@
hive.metastore.kerberos.principal
- The service principal for the metastore thrift server. You can
+ The service principal for the metastore Thrift server. You can
reference your host as _HOST and it will be replaced with
actual hostname. Need to set only in secure environment.
+
mkdir /tmp/hcat_source_releasecp hcatalog-src-0.4.0-incubating.tar.gz /tmp/hcat_source_releasecd /tmp/hcat_source_releasetar xzf hcatalog-src-0.4.0-incubating.tar.gzcd hcatalog-src-0.4.0-incubating/bin/ant -Dhcatalog.version=0.4.0
+ -Dforrest.home=forrest_home tar build/hcatalog-0.4.0.tar.gzmysql> flush privileges;mysql> quit;tar xzf hcatalog-version.tar.gztar xzf hcatalog-0.4.0.tar.gzmysql -u hive -D hivemetastoredb -hhcatdb.acme.com -p < share/hcatalog/hive/external/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql
mysql -u hive -D hivemetastoredb -hhcatdb.acme.com -p < share/hcatalog/hive/external/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql
Thrift Server Setup
@@ -88,14 +127,16 @@Select a user to run the Thrift server as. This user should not be a human user, and must be able to act as a proxy for other users. We suggest the name "hcat" for the user. Throughout the rest of this documentation - we will refer to this user as "hcat". If necessary, add the user to + we will refer to this user as hcat. If necessary, add the user to hcatsvr.acme.com.
Select a root directory for your installation of HCatalog. This
- directory must be owned by the hcat user. We recommend
- /usr/local/hcat. If necessary, create the directory.
/usr/local/hcat. If necessary, create the directory. You will
+ need to be the hcat user for the operations described in the remainder
+ of this Thrift Server Setup section.
- Download the HCatalog release into a temporary directory, and untar +
Copy the HCatalog installation tarball into a temporary directory, and untar it. Then change directories into the new distribution and run the HCatalog server installation script. You will need to know the directory you chose as root and the @@ -105,8 +146,8 @@ the port number you wish HCatalog to operate on which you will use to set portnum.
-tar zxf hcatalog-version.tar.gz
- cd hcatalog-version
tar zxf hcatalog-0.4.0.tar.gz
cd hcatalog-0.4.0
share/hcatalog/scripts/hcat_server_install.sh -r root -d dbroot -h hadoop_home -p portnum
Now you need to edit your root/etc/hcatalog/hive-site.xml file.
@@ -126,40 +167,41 @@
-- If default hdfs was specified in core-site.xml, path resolves to HDFS location.
-- Otherwise, path is resolved as local file: URI.
-This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).
+This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).
You can now procede to starting the server.
Start the HCatalog server by switching directories to
- root and invoking the start script
- share/hcatalog/scripts/hcat_server_start.sh
sbin/hcat_server.sh start
Server activity logs and gc logs are located in +
Server activity logs are located in
root/var/log/hcat_server. Logging configuration is located at
root/conf/log4j.properties. Server logging uses
DailyRollingFileAppender by default. It will generate a new
@@ -200,8 +240,7 @@
To stop the HCatalog server, change directories to the root
- directory and invoke the stop script
- share/hcatalog/scripts/hcat_server_stop.sh
sbin/hcat_server.sh stop
Select a root directory for your installation of HCatalog client.
We recommend /usr/local/hcat. If necessary, create the directory.
Download the HCatalog release into a temporary directory, and untar +
Copy the HCatalog installation tarball into a temporary directory, and untar it.
tar zxf hcatalog-version.tar.gz
-- If default hdfs was specified in core-site.xml, path resolves to HDFS location.
-- Otherwise, path is resolved as local file: URI.
-This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).
+This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).
In HCatalog 2.0 we introduce notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:
+Since HCatalog 0.2 provides notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:
No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message. This means that your existing code, when running with 0.2, will automatically send the notifications.
+No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message.
2. Subscribe to a topic you are interested in. When subscribing on a message bus, you need to subscribe to a particular topic to receive the messages that are being delivered on that topic.
The topic name corresponding to a particular table is stored in table properties and can be retrieved using following piece of code:
+The topic name corresponding to a particular table is stored in table properties and can be retrieved using the following piece of code:
You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog uses ActiveMQ as a JMS provider. In principle, any JMS provider can be used in client side; however, ActiveMQ is recommended. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html
+You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog is tested with ActiveMQ as a JMS provider, although any JMS provider can be used. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html .
Sometimes a user wants to wait until a collection of partitions is finished. For example, you may want to start processing after all partitions for a day are done. However, HCatalog has no notion of collections or hierarchies of partitions. To support this, HCatalog allows data writers to signal when they are finished writing a collection of partitions. Data readers may wait for this signal before beginning to read.
+The example code below illustrates how to send a notification when a set of partitions has been added.
+To signal, a data writer does this:
+The HCatalog command line interface (CLI) can be invoked as hcat.
Authentication
-If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. |
-
If other errors occur while using the HCatalog CLI, more detailed messages (if any) are written to /tmp/<username>/hive.log.
The HCatalog CLI supports these command line options:
Note the following:
@@ -67,8 +60,6 @@Assumptions
When using the HCatalog CLI, you cannot specify a permission string without read permissions for owner, such as -wxrwxr-x. If such a permission setting is desired, you can use the octal version instead, which in this case would be 375. Also, any other kind of permission string where the owner has read permissions (for example r-x------ or r--r--r--) will work fine.
- -HCatalog supports a subset of the Hive Data Definition Language. For those commands that are supported, any variances are noted below.
+HCatalog supports all Hive Data Definition Language except those operations that require running a MapReduce job. For commands that are supported, any variances are noted below.
+HCatalog does not support the following Hive DDL commands:
+CREATE TABLE
-The STORED AS clause in Hive is:
-The STORED AS clause in HCatalog is:
-CREATE TABLE
-Note the following:
-In this release, HCatalog supports only reading PigStorage formated text files and only writing RCFile formatted files. Therefore, for this release, the command must contain a "STORED AS" clause and either use RCFILE as the file format or specify org.apache.hadoop.hive.ql.io.RCFileInputFormat and org.apache.hadoop.hive.ql.io.RCFileOutputFormat as INPUTFORMAT and OUTPUTFORMAT respectively. |
-
If you create a table with a CLUSTERED BY clause you will not be able to write to it with Pig or MapReduce. This is because they do not understand how to partition the table, so attempting to write to it would cause data corruption.
+CREATE TABLE AS SELECT
-Not supported. Throws an exception with message "Operation Not Supported".
-CREATE TABLE LIKE
-Not supported. Throws an exception with message "Operation Not Supported".
+Not supported. Throws an exception with the message "Operation Not Supported".
+DROP TABLE
Supported. Behavior the same as Hive.
ALTER TABLE
-Note the following:
-ALTER TABLE FILE FORMAT
-Note the following:
-ALTER TABLE Change Column Name/Type/Position/Comment
-Not supported. Throws an exception with message "Operation Not Supported".
+Supported except for the REBUILD and CONCATENATE options. Behavior the same as Hive.
- -ALTER TABLE Add/Replace Columns
-Note the following:
-ALTER TABLE TOUCH
-Not supported. Throws an exception with message "Operation Not Supported".
+ +Note: Pig and MapReduce coannot read from or write to views.
+CREATE VIEW
-Not supported. Throws an exception with message "Operation Not Supported".
+Supported. Behavior same as Hive.
DROP VIEW
-Not supported. Throws an exception with message "Operation Not Supported".
+Supported. Behavior same as Hive.
ALTER VIEW
-Not supported. Throws an exception with message "Operation Not Supported".
+Supported. Behavior same as Hive.
+DESCRIBE
Supported. Behavior same as Hive.
+CREATE and DROP INDEX operations are supported.
+Note: Pig and MapReduce cannot write to a table that has auto rebuild on, because Pig and MapReduce do not know how to rebuild the index.
+CREATE and DROP FUNCTION operations are supported, but created functions must still be registered in Pig and placed in CLASSPATH for MapReduce.
+ +Any command not listed above is NOT supported and throws an exception with message "Operation Not Supported".
+Any command not listed above is NOT supported and throws an exception with the message "Operation Not Supported".
Authentication
+If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. |
+
If other errors occur while using the HCatalog CLI, more detailed messages are written to /tmp/<username>/hive.log.
+ Index: src/docs/src/documentation/content/xdocs/index.xml =================================================================== --- src/docs/src/documentation/content/xdocs/index.xml (revision 1301023) +++ src/docs/src/documentation/content/xdocs/index.xml (working copy) @@ -25,8 +25,8 @@HCatalog is a table management and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, Hive, Streaming – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, sequence files.
-(Note: In this release, Streaming is not supported. Also, HCatalog supports only writing RCFile formatted files and only reading PigStorage formated text files.)
+HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.
+HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.
HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and a command line interface for data definitions.
-(Note: HCatalog notification is not available in this release.)
+HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses + Hive's command line interface for issuing data definition and metadata exploration commands.
-The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatStorer only supports writing to one partition. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat respectively (see HCatalog Load and Store).
+The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat, respectively (see HCatalog Load and Store).
-The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatOutputFormat only supports writing to one partition (see HCatalog Input and Output).
+The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. (See HCatalog Input and Output.)
-Note: Currently there is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly as long as a SerDe for that data already exists. In the future we plan to write a HCatalogSerDe so that users won't need storage-specific SerDes and so that Hive users can write data to HCatalog. Currently, this is supported - if a Hive user writes data in the RCFile format, it is possible to read the data through HCatalog. Also, see Supported data formats.
+Note: There is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly.
-Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports most of the DDL portion of Hive's query language, allowing users to create, alter, drop tables, etc. The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the HCatalog Command Line Interface).
+Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. (Unsupported Hive DDL includes import/export, CREATE TABLE AS SELECT, ALTER TABLE options REBUILD and CONCATENATE, and ANALYZE TABLE ... COMPUTE STATISTICS.) The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the HCatalog Command Line Interface).
HCatalog presents a relational view of data in HDFS. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.
+HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.
-Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. (In the future some ability to integrate changes to a partition will be added.) Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see HCatalog Load and Store).
+Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see HCatalog Load and Store).
This simple data flow example shows how HCatalog is used to move data from the grid into a database. - From the database, the data can then be analyzed using Hive.
+This simple data flow example shows how HCatalog can help grid users share and access data.
First Joe in data acquisition uses distcp to get data onto the grid.
Second Sally in data processing uses Pig to cleanse and prepare the data.
-Without HCatalog, Sally must be manually informed by Joe that data is available, or use Oozie and poll on HDFS.
+Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
With HCatalog, Oozie will be notified by HCatalog data is available and can then start the Pig job
+With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
With HCatalog, Robert does not need to modify the table structure.
The HCatalog IMPORT and EXPORT commands enable you to:
-The output location of the exported dataset is a directory that has the following structure:
-Note that this directory structure can be created using the EXPORT as well as HCatEximOuptutFormat for MapReduce or HCatPigStorer for Pig. And the data can be consumed using the IMPORT command as well as HCatEximInputFormat for MapReduce or HCatPigLoader for Pig.
-Exports a table to a specified location.
- -|
- EXPORT TABLE tablename [PARTITION (partcol1=val1, partcol2=val2, ...)] TO 'filepath' - |
-
|
- TABLE tablename - |
-
- The table to be exported. The table can be a simple table or a partitioned table. -If the table is partitioned, you can specify a specific partition of the table by specifying values for all of the partitioning columns or specifying a subset of the partitions of the table by specifying a subset of the partition column/value specifications. In this case, the conditions are implicitly ANDed to filter the partitions to be exported. - |
-
|
- PARTITION (partcol=val ...) - |
-
- The partition column/value specifications. - |
-
|
- TO 'filepath' - |
-
- The filepath (in single quotes) designating the location for the exported table. The file path can be: -
|
-
The EXPORT command exports a table's data and metadata to the specified location. Because the command actually copies the files defined for the table/partions, you should be aware of the following:
-Also, note the following:
-The examples assume the following tables:
-Example 1
-This example exports the entire table to the target location. The table and the exported copy are now independent; any further changes to the table (data or metadata) do not impact the exported copy. The exported copy can be manipulated/deleted w/o any effect on the table.
-Example 2
-This example exports the entire table including all the partitions' data/metadata to the target location.
-Example 3
-This example exports a subset of the partitions - those which have country = in - to the target location.
-Example 4
-This example exports a single partition - that which has country = in, state = tn - to the target location.
-Imports a table from a specified location.
- -|
- IMPORT [[EXTERNAL] TABLE tablename [PARTITION (partcol1=val1, partcol2=val2, ...)]] FROM 'filepath' [LOCATION 'tablepath'] - |
-
|
- EXTERNAL - |
-
- Indicates that the imported table is an external table. - |
-
|
- TABLE tablename - |
-
- The target to be imported, either a table or a partition. -If the table is partitioned, you can specify a specific partition of the table by specifying values for all of the partitioning columns, or specify all the (exported) partitions by not specifying any of the partition parameters in the command. - |
-
|
- PARTITION (partcol=val ...) - |
-
- The partition column/value specifications. - |
-
|
- FROM 'filepath' - |
-
- The filepath (in single quotes) designating the source location the table will be copied from. The file path can be: -
|
-
|
- LOCATION 'tablepath' - |
-
- (optional) The tablepath (in single quotes) designating the target location the table will be copied to. -If not specified, then: -
|
-
The IMPORT command imports a table's data and metadata from the specified location. The table can be a managed table (data and metadata are both removed on drop table/partition) or an external table (only metadata is removed on drop table/partition). For more information, see Hive's Create/Drop Table.
- -Because the command actually copies the files defined for the table/partions, you should be aware of the following:
-Also, note the following:
-The examples assume the following tables:
-Example 1
-This example imports the table as a managed target table, default location. The metadata is stored in the metastore and the table's data files in the warehouse location of the current database.
- -Example 2
-This example imports the table as a managed target table, default location. The imported table is given a new name.
- - -Example 3
-This example imports the table as an external target table, imported in-place. The metadata is copied to the metastore.
- - -Example 4
-This example imports the table as an external target table, imported to another location. The metadata is copied to the metastore.
- - -Example 5
-This example imports the table as a managed target table, non-default location. The metadata is copied to the metastore.
- - -Example 6
-This example imports all the exported partitions since the source was a partitioned table.
- - -Example 7
-This example imports only the specified partition.
-HCatEximOutputFormat and HCatEximInputFormat can be used in Hadoop environments where there is no HCatalog instance available. HCatEximOutputFormat can be used to create an 'exported table' dataset, which later can be imported into a HCatalog instance. It can also be later read via HCatEximInputFormat or HCatEximLoader.
- -The user can specify the parameters of the table to be created by means of the setOutput method. The metadata and the data files are created in the specified location.
-The target location must be empty and the user must have write access.
-The user specifies the data collection location and optionally a filter for the partitions to be loaded via the setInput method. Optionally, the user can also specify the projection columns via the setOutputSchema method.
-The source location should have the correct layout as for a exported table, and the user should have read access.
-HCatEximStorer and HCatEximLoader can be used in hadoop/pig environments where there is no HCatalog instance available. HCatEximStorer can be used to create an 'exported table' dataset, which later can be imported into a HCatalog instance. It can also be later read via HCatEximInputFormat or HCatEximLoader.
- -The HCatEximStorer is initialized with the output location for the exported table. Optionally the user can specify the partition specification for the data, plus rename the schema elements as part of the storer.
-The rest of the storer semantics use the same design as HCatStorer.
- -Example
-The HCatEximLoader is passed the location of the exported table as usual by the LOAD statement. The loader loads the metadata and data as required from the location. Note that partition filtering is not done efficiently when eximloader is used; the filtering is done at the record level rather than at the file level.
-The rest of the loader semantics use the same design as HCatLoader.
-Example
-Use Case 1
-Transfer data between different HCatalog/hadoop instances, with no renaming of tables.
-Use Case 2
-Transfer data to a hadoop instance which does not have HCatalog and process it there.
-Use Case 3
-Create an exported dataset in a hadoop instance which does not have HCatalog and then import into HCatalog in a different instance.
-The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog managed tables. If you run your Pig script using the "pig" command (the bin/pig Perl script) no set up is required.
-If you run your Pig script using the "java" command (java -cp pig.jar...), then the hcat jar needs to be included in the classpath of the java command line (using the -cp option). Additionally, the following properties are required in the command line:
-The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog managed tables.
Authentication
-If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. |
-
HCatLoader is accessed via a Pig load statement.
Assumptions
-You must specify the database name and table name using this format: 'dbname.tablename'. Both the database and table must be created prior to running your Pig script. The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is 'default' and the string becomes 'default.tablename'.
+You must specify the table name in single quotes: LOAD 'tablename'. If you are using a non-default database you must specify your input as 'dbname.tablename'. If you are using Pig 0.9.2 or earlier, you must create your database and table prior to running the Pig script. Beginning with Pig 0.10 you can issue these create commands in Pig using the SQL command.
+The Hive metastore lets you create tables without specifying a database; if you + created tables this way, then the database name is 'default' and is not required when + specifying the table for HCatLoader.
If the table is partitioned, you can indicate which partitions to scan by immediately following the load statement with a partition filter statement - (see Examples).
-Restrictions apply to the types of columns HCatLoader can read.
HCatLoader can read only the data types listed in the table. The table shows how Pig will interpret the HCatalog data type.
-(Note: HCatalog does not support type Boolean.)
|
@@ -90,12 +72,12 @@
primitives (int, long, float, double, string) |
- int, long, float, double int, long, float, double, string to chararray |
|
- map (key type should be string, valuetype can be a primitive listed above) +map (key type should be string, valuetype must be string) |
map @@ -103,72 +85,115 @@ |
|
- List<primitive> or List<map> where map is of the type noted above +List<any type> |
- bag, with the primitive or map type as the field in each tuple of the bag +bag |
|
- struct<primitive fields> +struct<any type fields> |
tuple |
|
- List<struct<primitive fields>> - |
-
- bag, where each tuple in the bag maps to struct <primitive fields> - |
-
Pig does not automatically pick up HCatalog jars. You will need tell Pig where your HCatalog jars are. +These include the Hive jars used by the HCatalog client. To do this, you must define the environment +variable PIG_CLASSPATH with the appropriate jars. HCat can tell you the jars it needs. In order to do this it +needs to know where Hadoop is installed. Also, you need to tell Pig the URI for your metastore, in the PIG_OPTS +variable. In the case where you have installed Hadoop and HCatalog via tar, you can do:
+ +If you are using a secure cluster and a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. |
+
This load statement will load all partitions of the specified table.
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement. -The filter statement can include conditions on partition as well as non-partition columns.
+If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
Certain combinations of conditions on partition and non-partition columns are not allowed in filter statements.
-For example, the following script results in this error message:
-ERROR 1112: Unsupported query: You have an partition column (datestamp ) in a construction like: (pcond and ...) or ( pcond and ...) where pcond is a condition on a partition column.
-A workaround is to restructure the filter condition by splitting it into multiple filter conditions, with the first condition immediately following the load statement.
-
To scan a whole table:
Notice that the schema is automatically provided to Pig, there's no need to declare name and age as fields, as if +you were loading from a file.
+ +Example of scanning a single partition. Assume the table web_logs is partitioned by the column datestamp:
+ +Pig will push the datestamp filter shown here to HCatalog, so that HCat knows to just scan the partition where +datestamp = '20110924'. You can combine this filter with others via 'and':
+ +Pig will split the above filter, pushing the datestamp portion to HCatalog and retaining the user is not null part +to apply itself. You can also give a more complex filter to retrieve a set of partitions:
+ +Assumptions
-You must specify the database name and table name using this format: 'dbname.tablename'. Both the database and table must be created prior to running your Pig script. The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is 'default' and string becomes 'default.tablename'.
- -For the USING clause, you can have two string arguments:
-You must specify the table name in single quotes: LOAD 'tablename'. Both the database and table must be created prior to running your Pig script. If you are using a non-default database you must specify your input as 'dbname.tablename'. If you are using Pig 0.9.2 or earlier, you must create your database and table prior to running the Pig script. Beginning with Pig 0.10 you can issue these create commands in Pig using the SQL command.
+The Hive metastore lets you create tables without specifying a database; if you created +tables this way, then the database name is 'default' and you do not need to specify the +database name in the store statement.
+For the USING clause, you can have a string argument that represents key/value pairs +for partition. This is a mandatory argument when you are writing to a partitioned table +and the partition column is not in the output column. The values for partition keys +should NOT be quoted.
+If partition columns are present in data they need not be specified as a STORE argument. Instead HCatalog will use these values to place records in the appropriate partition(s). It is valid to specify some partition keys in the STORE statement and have other partition keys in the data.
+ +You can write to non-partitioned table simply by using HCatStorer. The contents of the table will be overwritten:
+ +To add one new partition to a partitioned table, specify the partition value in store function. Pay careful +attention to the quoting, as the whole string must be single quoted and separated with an equals sign:
+ +To write into multiple partitions at one, make sure that the partition column is present in your data, then call +HCatStorer with no argument:
+ +Restrictions apply to the types of columns HCatStorer can write.
HCatStorer can write only the data types listed in the table. The table shows how Pig will interpret the HCatalog data type.
-(Note: HCatalog does not support type Boolean.)
|
@@ -229,15 +273,12 @@
primitives (int, long, float, double, string) |
- int, long, float, double, string int, long, float, double, string to chararray |
|
- map (key type should be string, valuetype can be a primitive listed above) +map (key type should be string, valuetype must be string) |
map @@ -245,28 +286,20 @@ |
|
- List<primitive> or List<map> where map is of the type noted above +List<any type> |
- bag, with the primitive or map type as the field in each tuple of the bag +bag |
|
- struct<primitive fields> +struct<any type fields> |
tuple |
|
- List<struct<primitive fields>> - |
-
- bag, where each tuple in the bag maps to struct <primitive fields> - |
-