Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-1700

Failed to load data to existed table after spark session restarted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.3.0
    • data-load
    • None

    Description

      1. scenario

      I encounterd loading data to existed carbondata table failure after query the table after restarting spark session. I have this failure in spark local mode (found it during local test) and haven't test in other scenarioes.

      The problem can be reproduced by following steps:

      0. START: start a session;
      1. CREATE: create table `t1`;
      2. LOAD: create a dataframe and write apppend to `t1`;
      3. STOP: stop current session;

      4. START: start a session;
      5. QUERY: query table `t1`; ---- This step is essential to reproduce the problem.
      6. LOAD: create a dataframe and write append to `t1`; — This step will be failed.

      Error will be thrown in Step6. The error message in console looks like

      ```
      java.lang.NullPointerException was thrown.
      java.lang.NullPointerException
      at org.apache.spark.sql.execution.command.management.LoadTableCommand.processData(LoadTableCommand.scala:92)
      at org.apache.spark.sql.execution.command.management.LoadTableCommand.run(LoadTableCommand.scala:60)
      at org.apache.spark.sql.CarbonDataFrameWriter.loadDataFrame(CarbonDataFrameWriter.scala:141)
      at org.apache.spark.sql.CarbonDataFrameWriter.writeToCarbonFile(CarbonDataFrameWriter.scala:50)
      at org.apache.spark.sql.CarbonDataFrameWriter.appendToCarbonFile(CarbonDataFrameWriter.scala:42)
      at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:110)
      at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
      at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
      ```

      The following code can be pasted in `TestLoadDataFrame.scala` to reproduce this problem —— but keep
      in mind you should manually run the first test and then the second in different iteration (to make sure that the sparksession is restarted).

      ```
      test("prepare")

      { sql("drop table if exists carbon_stand_alone") sql( "create table if not exists carbon_stand_alone (c1 string, c2 string, c3 int)" + " stored by 'carbondata'").collect() sql("select * from carbon_stand_alone").show() df.write .format("carbondata") .option("tableName", "carbon_stand_alone") .option("tempCSV", "false") .mode(SaveMode.Append) .save() }

      test("test load dataframe after query")

      { sql("select * from carbon_stand_alone").show() // the following line will cause failure df.write .format("carbondata") .option("tableName", "carbon_stand_alone") .option("tempCSV", "false") .mode(SaveMode.Append) .save() // if it works fine, it sould be true checkAnswer( sql("select count(*) from carbon_stand_alone where c3 > 500"), Row(31500 * 2) ) }

      ```

      1. ANALYSE
        I went through the code and found the problem was caused by NULL `tableProperties` in `tablemeta: tableMeta.carbonTable.getTableInfo
        .getFactTable.getTableProperties` (we will name it `propertyInTableInfo` for short) is null in Line89 in `LoadTableCommand.scala`.

      After debug, I found that the `propertyInTableInfo` sett in `CarbonTableInputFormat.setTableInfo(...)` had the correct value. But `CarbonTableInputFormat.getTableInfo(...)` had the incorrect value. The setter is used to serialized TableInfo, while the getter is used to deserialized TableInfo ———— That means there are something wrong in serialization-deserialization.

      Keep diving into the code, I found that serialization and deserialization in `TableSchema`, a member of `TableInfo`, ignores the `tableProperties` member, thus causing this value empty after deserialization. Since this value has not been initialized in construtor, so the value remains `NULL` and cause the NPE problem.

      1. RESOLVE

      1. Initialize `tableProperties` in `TableSchema`
      2. Include `tableProperties` in serialization-deserialization of `TableSchema`

      1. Notes

      Although the bug has been fix, I still can't understand why the problem can be triggered in above way.

      Tests need the sparksession to be restarted, which is impossible currently, so no tests will be added.

      Attachments

        Issue Links

          Activity

            People

              xuchuanyin Chuanyin Xu
              xuchuanyin Chuanyin Xu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h