Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-22711

Spark connector doesn't use the given mapping when inserting data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: connector-1.0.0
    • Fix Version/s: connector-1.0.1
    • Component/s: hbase-connectors
    • Labels:
      None

      Description

      In some cases a Spark DataFrames cannot be read back with the same mapping as they were written. For example:

      val sql = spark.sqlContext
      
      val persons =
          """[
            |{"name": "alice", "age": 20, "height": 5, "email": "alice@alice.com"},
            |{"name": "bob", "age": 23, "height": 6, "email": "bob@bob.com"},
            |{"name": "carol", "age": 12, "email": "carol@carol.com", "height": 4.11}
            |]
          """.stripMargin
      
      val df = spark.read.json(Seq(persons).toDS)
      
      df.write
        .format("org.apache.hadoop.hbase.spark")
        .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
        .option("hbase.table", "person")
        .option("hbase.spark.use.hbasecontext", false)
        .save()
      

      It cannot be read back with the same mapping:

      val df2 = sql.read
        .format("org.apache.hadoop.hbase.spark")
        .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height")
        .option("hbase.table", "person")
        .option("hbase.spark.use.hbasecontext", false)
        .load()
      
      df2.createOrReplaceTempView("tableView")
      
      val results = sql.sql("SELECT * FROM tableView")
      results.show()
      

      The results:

      +---+-----+---------+---------------+
      |age| name|   height|          email|
      +---+-----+---------+---------------+
      |  0|alice|   2.3125|alice@alice.com|
      |  0|  bob|    2.375|    bob@bob.com|
      |  0|carol|2.2568748|carol@carol.com|
      +---+-----+---------+---------------+
      

      Spark stores integer values in long, floating point values in double so shorts become 8 bytes long, floats also become 8 bytes long in HBase:

      shell> scan 'person'
       alice                column=p:age, timestamp=1563450714829, value=\x00\x00\x00\x00\x00\x00\x00\x14
       alice                column=p:height, timestamp=1563450714829, value=@\x14\x00\x00\x00\x00\x00\x00
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                meszibalu Balazs Meszaros
                Reporter:
                meszibalu Balazs Meszaros
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: