Index: src/docs/src/documentation/content/xdocs/inputoutput.xml
===================================================================
--- src/docs/src/documentation/content/xdocs/inputoutput.xml (revision 1377597)
+++ src/docs/src/documentation/content/xdocs/inputoutput.xml (working copy)
@@ -33,21 +33,27 @@
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog managed tables. The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table. The API exposed by HCatInputFormat is shown below. To use HCatInputFormat to read data, first instantiate as The API exposed by HCatInputFormat is shown below. It includes: To use HCatInputFormat to read data, first instantiate an You can use the You can use the You can use the HCatOutputFormat is used with MapReduce jobs to write data to HCatalog managed tables. HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. HCatOutputFormat exposes a Hadoop 20 MapReduce API for writing data to a table.
+ HCatOutputFormat exposes a Hadoop 0.20 MapReduce API for writing data to a table.
When a MapReduce job uses HCatOutputFormat to write output, the default OutputFormat configured for the table is used and the new partition is published to the table after the job completes. The API exposed by HCatOutputFormat is shown below. The first call on the HCatOutputFormat must be The API exposed by HCatOutputFormat is shown below. It includes: The first call on the HCatOutputFormat must be Running MapReduce with HCatalog
-Your MapReduce program will need to know where the thrift server to connect to is. The
-easiest way to do this is pass it as an argument to your Java program. You will need to
-pass the Hive and HCatalog jars MapReduce as well, via the -libjars argument.InputJobInfo with the necessary information from the table being read
+
+
+
+ setInputsetOutputSchemagetTableSchemaInputJobInfo
+ with the necessary information from the table being read
and then call setInput with the InputJobInfo.setOutputSchema method to include a projection schema, to
-specify specific output fields. If a schema is not specified all the columns in the table
+specify the output fields. If a schema is not specified, all the columns in the table
will be returned.getTableSchema methods to determine the table schema for a specified input table.getTableSchema method to determine the table schema for a specified input table.setOutput; any other call will throw an exception saying the output format is not initialized. The schema for the data being written out is specified by the setSchema method. You must call this method, providing the schema of data you are writing. If your data has same schema as table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema().
+
+
+ setOutputsetSchemagetTableSchemasetOutput; any other call will throw an exception saying the output format is not initialized. The schema for the data being written out is specified by the setSchema method. You must call this method, providing the schema of data you are writing. If your data has the same schema as the table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema().
This works but Hadoop will ship libjars every time you run the MapReduce program, treating the files as different cache entries, which is not efficient and may deplete the Hadoop distributed cache.
+Instead, you can optimize to ship libjars using HDFS locations. By doing this, Hadoop will reuse the entries in the distributed cache.
+ +Authentication
Read Example
+
The following very simple MapReduce program reads data from one table which it assumes to have an integer in the
-second column, and counts how many different values it sees. That is, is does the
-equivalent of select col1, count(*) from $table group by col1;.
+second column, and counts how many different values it sees. That is, it does the
+equivalent of "select col1, count(*) from $table group by col1;".
To scan just selected partitions of a table, a filter describing the desired partitions can be passed to -InputJobInfo.create. To scan a single filter, the filter string should look like: "datestamp=20120401" where -datestamp is the partition column name and 20120401 is the value you want to read.
+InputJobInfo.create. To scan a single partition, the filter string should look like: "ds=20120401"
+where the datestamp "ds" is the partition column name and "20120401" is the value
+you want to read (year, month, and day).
+Filter Operators
+A filter can contain the operators 'and', 'or', 'like', '()', '=', '<>' (not equal), '<', '>', '<=' -and '>='. For example:
+A filter can contain the operators 'and', 'or', 'like', '()', '=', '<>' (not equal), '<', '>', '<=' and '>='.
+For example:
+datestamp > "20110924"datestamp < "20110925datestamp <= "20110925" and datestamp >= "20110924"ds > "20110924"ds < "20110925"ds <= "20110925" and ds >= "20110924"Scan Filter
+Assume for example you have a web_logs table that is partitioned by the column datestamp. You could select one partition of the table by changing
+Assume for example you have a web_logs table that is partitioned by the column "ds". You could select one partition of the table by changing
This filter must reference only partition columns. Values from other columns will cause the job to fail.
+Write Filter
+To write to a single partition you can change the above example to have a Map of key value pairs that describe all of the partition keys and values for that partition. In our example web_logs table, there is only one partition -column (datestamp), so our Map will have only one entry. Change
+column (ds), so our Map will have only one entry. Change
to