diff --git a/src/main/asciidoc/_chapters/spark.adoc b/src/main/asciidoc/_chapters/spark.adoc index 774d137..ab464fa 100644 --- a/src/main/asciidoc/_chapters/spark.adoc +++ b/src/main/asciidoc/_chapters/spark.adoc @@ -62,21 +62,66 @@ on different nodes there is no dependence of co-location. Think of every Spark Executor as a multi-threaded client application. This allows any Spark Tasks running on the executors to access the shared Connection object. +Be sure the hbase jars can be seen by the spark shell when it runs. To add +the hbase jars, do either + +[source, bash] +---- +$ spark-shell --conf "spark.driver.extraClassPath=$(hbase classpath)" --conf "spark.executor.extraClassPath=$(hbase classpath)" +... +(in the above, 'hbase classpath' is running the $HBASE_HOME/bin/hbase command passing the classpath argument; you +may need to add path to 'hbase') + +or + +[source, bash] +---- +$ SPARK_CLASSPATH=`hbase classpath` spark-shell +... + +Also, be aware that the examples below neglect importing referenced classes in +the name of keeping examples curt. Don't forget to do this. So, for example, to +reference the HBaseConfiguration class in your your spark shell, you first +must do + +[source, scala] +---- +scala> import org.apache.hadoop.hbase.HBaseConfiguration +... + +To add the HBaseContext, do the following +[source, scala] +---- +scala> import org.apache.hadoop.hbase.spark.HBaseContext +... + +.Adding HBase Jars to + .HBaseContext Usage Example ==== This example shows how HBaseContext can be used to do a `foreachPartition` on a RDD -in Scala: +in Scala. [source, scala] ---- +// The below line may not be necessary. See if you have a SparkContext already +// set up for you by seeing if 'sc' is already defined -- just type 'sc' to see. val sc = new SparkContext("local", "test") +import org.apache.hadoop.hbase.HBaseConfiguration val config = new HBaseConfiguration() ... +import org.apache.hadoop.hbase.spark.HBaseContext val hbaseContext = new HBaseContext(sc, config) +// Presume there is an hbase-site.xml on the CLASSPATH with configuration +// that points at a running cluster. Use ConnectionFactory to get a +// cluster connection. +val conn = org.apache.hadoop.hbase.client.ConnectionFactory(config) + +// TODO: Explain 'it' and how we got the 'rdd' rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1")) it.foreach((putRecord) => { @@ -687,4 +732,4 @@ The date frame `df` returned by `withCatalog` function could be used to access t After loading df DataFrame, users can query data. registerTempTable registers df DataFrame as a temporary table using the table name avrotable. `sqlContext.sql` function allows the user to execute SQL queries. -==== \ No newline at end of file +====