Index: src/docs/src/documentation/content/xdocs/inputoutput.xml =================================================================== --- src/docs/src/documentation/content/xdocs/inputoutput.xml (revision 1311550) +++ src/docs/src/documentation/content/xdocs/inputoutput.xml (working copy) @@ -148,8 +148,29 @@ export HADOOP_HOME=<path_to_hadoop_install> export HCAT_HOME=<path_to_hcat_install> -export LIB_JARS=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar,$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar,$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar,$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar,$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar,$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar,$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar -export HADOOP_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar:$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar:$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar:$HCAT_HOME/etc/hcatalog +export LIB_JARS=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar, +$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar, +$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar, +$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar,$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar, +$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar, +$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar, +$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar, +$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar, +$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar, +$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar, +$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar +export HADOOP_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar: +$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar: +$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar: +$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar: +$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar: +$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar: +$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar: +$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar: +$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar: +$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar: +$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar: +$HCAT_HOME/etc/hcatalog $HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/conf jar <path_to_jar> <main_class> -libjars $LIB_JARS <program_arguments> @@ -162,7 +183,7 @@ -

Examples

+

Read Example

The following very simple MapReduce program reads data from one table which it assumes to have an integer in the @@ -182,7 +203,8 @@ protected void map( WritableComparable key, HCatRecord value, - org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord, IntWritable, IntWritable>.Context context) + org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord, + IntWritable, IntWritable>.Context context) throws IOException, InterruptedException { age = (Integer) value.get(1); context.write(new IntWritable(age), new IntWritable(1)); @@ -194,9 +216,12 @@ @Override - protected void reduce(IntWritable key, java.lang.Iterable<IntWritable> - values, org.apache.hadoop.mapreduce.Reducer<IntWritable,IntWritable,WritableComparable,HCatRecord>.Context context) - throws IOException ,InterruptedException { + protected void reduce( + IntWritable key, + java.lang.Iterable<IntWritable> values, + org.apache.hadoop.mapreduce.Reducer<IntWritable, IntWritable, + WritableComparable, HCatRecord>.Context context) + throws IOException, InterruptedException { int sum = 0; Iterator<IntWritable> iter = values.iterator(); while (iter.hasNext()) { @@ -249,26 +274,36 @@ } -

Notice a number of important points about this program: -



-1) The implementation of Map takes HCatRecord as an input and the implementation of Reduce produces it as an output. -

-2) This example program assumes the schema of the input, but it could also retrieve the schema via -HCatOutputFormat.getOutputSchema() and retrieve fields based on the results of that call. -

-3) The input descriptor for the table to be read is created by calling InputJobInfo.create. It requires the database name, +

Notice a number of important points about this program:

+
    +
  1. The implementation of Map takes HCatRecord as an input and the implementation of Reduce produces it as an output.
  2. +
  3. This example program assumes the schema of the input, but it could also retrieve the schema via +HCatOutputFormat.getOutputSchema() and retrieve fields based on the results of that call.
  4. +
  5. The input descriptor for the table to be read is created by calling InputJobInfo.create. It requires the database name, table name, and partition filter. In this example the partition filter is null, so all partitions of the table -will be read. -

    -4) The output descriptor for the table to be written is created by calling OutputJobInfo.create. It requires the +will be read.
  6. +
  7. The output descriptor for the table to be written is created by calling OutputJobInfo.create. It requires the database name, the table name, and a Map of partition keys and values that describe the partition being written. -In this example it is assumed the table is unpartitioned, so this Map is null. -

    +In this example it is assumed the table is unpartitioned, so this Map is null.
  8. +

To scan just selected partitions of a table, a filter describing the desired partitions can be passed to -InputJobInfo.create. This filter can contain the operators '=', '<', '>', '<=', -'>=', '<>', 'and', 'or', and 'like'. Assume for example you have a web_logs -table that is partitioned by the column datestamp. You could select one partition of the table by changing

+InputJobInfo.create. To scan a single filter, the filter string should look like: "datestamp=20120401" where +datestamp is the partition column name and 20120401 is the value you want to read.

+ +

Filter Operators

+ +

A filter can contain the operators 'and', 'or', 'like', '()', '=', '<>' (not equal), '<', '>', '<=' +and '>='. For example:

+ + +

Scan Filter

+ +

Assume for example you have a web_logs table that is partitioned by the column datestamp. You could select one partition of the table by changing

HCatInputFormat.setInput(job, InputJobInfo.create(dbName, inputTableName, null)); @@ -281,6 +316,8 @@

This filter must reference only partition columns. Values from other columns will cause the job to fail.

+ +

Write Filter

To write to a single partition you can change the above example to have a Map of key value pairs that describe all of the partition keys and values for that partition. In our example web_logs table, there is only one partition Index: src/docs/src/documentation/content/xdocs/loadstore.xml =================================================================== --- src/docs/src/documentation/content/xdocs/loadstore.xml (revision 1311550) +++ src/docs/src/documentation/content/xdocs/loadstore.xml (working copy) @@ -28,7 +28,7 @@ Set Up

The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog managed tables.

-

Authentication

+ @@ -115,12 +115,22 @@ export HADOOP_HOME=<path_to_hadoop_install> export HCAT_HOME=<path_to_hcat_install> -PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HADOOP_HOME/conf:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar +PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/ +hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/ +share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar: +$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HADOOP_HOME/ +conf:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar export PIG_OPTS=-Dhive.metastore.uris=thrift://<hostname>:<port> -<path_to_pig_install>/bin/pig -Dpig.additional.jars=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar <script.pig> +<path_to_pig_install>/bin/pig -Dpig.additional.jars=$HCAT_HOME/share/hcatalog/ +hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/ +share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar: +$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2- +api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar + <script.pig> +

Authentication

@@ -148,17 +158,15 @@ A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); -- date is a partition column; age is not - B = filter A by date == '20100819' and age < 30; -- both date and country are partition columns - C = filter A by date == '20100819' and country == 'US'; ... ... -

To scan a whole table:

+

To scan a whole table, for example:

a = load 'student_data' using org.apache.hcatalog.pig.HCatLoader(); @@ -169,14 +177,14 @@

Notice that the schema is automatically provided to Pig, there's no need to declare name and age as fields, as if you were loading from a file.

-

Example of scanning a single partition. Assume the table web_logs is partitioned by the column datestamp:

+

To scan a single partition of the table web_logs, for example, partitioned by the column datestamp:

a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader(); b = filter a by datestamp == '20110924'; -

Pig will push the datestamp filter shown here to HCatalog, so that HCat knows to just scan the partition where +

Pig will push the datestamp filter shown here to HCatalog, so that HCatalog knows to just scan the partition where datestamp = '20110924'. You can combine this filter with others via 'and':

@@ -184,14 +192,40 @@ b = filter a by datestamp == '20110924' and user is not null; -

Pig will split the above filter, pushing the datestamp portion to HCatalog and retaining the user is not null part -to apply itself. You can also give a more complex filter to retrieve a set of partitions:

+

Pig will split the above filter, pushing the datestamp portion to HCatalog and retaining the user is not null part +to apply itself. You can also give a more complex filter to retrieve a set of partitions.

+

Filter Operators

+ +

A filter can contain the operators 'and', 'or', '()', '==', '!=', '<', '>', '<=' +and '>='.

+ +

For example:

+ a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader(); +b = filter a by datestamp > '20110924'; + + +

A complex filter can have various combinations of operators, such as:

+ + +a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader(); +b = filter a by datestamp == '20110924' or datestamp == '20110925'; + + +

These two examples have the same effect:

+ + +a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader(); b = filter a by datestamp >= '20110924' and datestamp <= '20110925'; + +a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader(); +b = filter a by datestamp <= '20110925' and datestamp >= '20110924'; + + @@ -247,8 +281,8 @@

To write into multiple partitions at one, make sure that the partition column is present in your data, then call HCatStorer with no argument:

-store z into 'web_data' using org.apache.hcatalog.pig.HCatStorer(); -- datestamp -must be a field in the relation z +store z into 'web_data' using org.apache.hcatalog.pig.HCatStorer(); + -- datestamp must be a field in the relation z

If you are using a secure cluster and a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/<username>/hive.log, then make sure you have run "kinit <username>@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server.