Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
connector-1.0.0
-
None
Description
In some cases a Spark DataFrames cannot be read back with the same mapping as they were written. For example:
val sql = spark.sqlContext val persons = """[ |{"name": "alice", "age": 20, "height": 5, "email": "alice@alice.com"}, |{"name": "bob", "age": 23, "height": 6, "email": "bob@bob.com"}, |{"name": "carol", "age": 12, "email": "carol@carol.com", "height": 4.11} |] """.stripMargin val df = spark.read.json(Seq(persons).toDS) df.write .format("org.apache.hadoop.hbase.spark") .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height") .option("hbase.table", "person") .option("hbase.spark.use.hbasecontext", false) .save()
It cannot be read back with the same mapping:
val df2 = sql.read .format("org.apache.hadoop.hbase.spark") .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height") .option("hbase.table", "person") .option("hbase.spark.use.hbasecontext", false) .load() df2.createOrReplaceTempView("tableView") val results = sql.sql("SELECT * FROM tableView") results.show()
The results:
+---+-----+---------+---------------+ |age| name| height| email| +---+-----+---------+---------------+ | 0|alice| 2.3125|alice@alice.com| | 0| bob| 2.375| bob@bob.com| | 0|carol|2.2568748|carol@carol.com| +---+-----+---------+---------------+
Spark stores integer values in long, floating point values in double so shorts become 8 bytes long, floats also become 8 bytes long in HBase:
shell> scan 'person' alice column=p:age, timestamp=1563450714829, value=\x00\x00\x00\x00\x00\x00\x00\x14 alice column=p:height, timestamp=1563450714829, value=@\x14\x00\x00\x00\x00\x00\x00
Attachments
Issue Links
- links to