Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16996

Hive ACID delta files not seen

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.2, 1.6.3, 2.1.2, 2.2.0
    • None
    • SQL
    • Hive 1.2.1, Spark 1.5.2

    Description

      spark-sql seems not to see data stored as delta files in an ACID Hive table.

      Actually I encountered the same problem as describe here : http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp

      For example, create an ACID table with HiveCLI and insert a row :

      set hive.support.concurrency=true;
      set hive.enforce.bucketing=true;
      set hive.exec.dynamic.partition.mode=nonstrict;
      set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
      set hive.compactor.initiator.on=true;
      set hive.compactor.worker.threads=1;
       CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 BUCKETS
          ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
          STORED AS 
            INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
            OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
          TBLPROPERTIES ('transactional'='true');
      
      INSERT INTO deltas VALUES("a","a");
      

      Then make a query with spark-sql CLI :

      SELECT * FROM deltas;
      

      That query gets no result and there are no errors in logs.
      If you go to HDFS to inspect table files, you find only deltas

      ~>hdfs dfs -ls /apps/hive/warehouse/deltas
      Found 1 items
      drwxr-x---   - me hdfs          0 2016-08-10 14:03 /apps/hive/warehouse/deltas/delta_0020943_0020943
      

      Then if you run compaction on that table (in HiveCLI) :

      ALTER TABLE deltas COMPACT 'MAJOR';
      

      As a result, the delta will be compute into a base file :

      ~>hdfs dfs -ls /apps/hive/warehouse/deltas
      Found 1 items
      drwxrwxrwx   - me hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
      

      Go back to spark-sql and the same query gets a result :

      SELECT * FROM deltas;
      a       a
      Time taken: 0.477 seconds, Fetched 1 row(s)
      

      But next time you make an insert into Hive table :

      INSERT INTO deltas VALUES("b","b");
      

      spark-sql will immediately see changes :

      SELECT * FROM deltas;
      a       a
      b       b
      Time taken: 0.122 seconds, Fetched 2 row(s)
      

      Yet there was no other compaction, but spark-sql "sees" the base AND the delta file :

      ~> hdfs dfs -ls /apps/hive/warehouse/deltas
      Found 2 items
      drwxrwxrwx   - valdata hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
      drwxr-x---   - valdata hdfs          0 2016-08-10 15:31 /apps/hive/warehouse/deltas/delta_0020956_0020956
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bbonnet Benjamin BONNET
              Votes:
              2 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: