Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15189

No base file for ACID table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 1.2.1
    • None
    • Transactions
    • None
    • HDP 2.4, HDP2.5

    Description

      Hi,
      When one creates a new ACID table and inserts data into it using INSERT INTO, Hive does not write a 'base' file : it only creates a delta. That may lead to two issues (at least):

      1. when you try to read it, you might get a 'serious problem' like that :
        java.lang.RuntimeException: serious problem
                at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
                at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
                at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
                at scala.Option.getOrElse(Option.scala:120)
                at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
                at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
                at scala.Option.getOrElse(Option.scala:120)
                at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
                at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
                at scala.Option.getOrElse(Option.scala:120)
                at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
                at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
                at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
                at scala.Option.getOrElse(Option.scala:120)
                at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
                at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
                at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
                at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
                at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
                at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
                at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
                at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
                at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:635)
                at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:64)
                at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
                at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
                at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
                at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
                at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
                at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
                at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
                at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
        Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: delta_0000000_0000000 does not start with base_
                at java.util.concurrent.FutureTask.report(FutureTask.java:122)
                at java.util.concurrent.FutureTask.get(FutureTask.java:192)
                at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
                ... 44 more
        

        I do not always get this error, but when I get it, I need to drop/recreate table.

      2. Spark-sql does not see data as long as there is no base file. So as long as compaction has not occurred, no data can be read using Spar-SQL. See https://issues.apache.org/jira/browse/SPARK-16996

      I know Hive always creates a base file when ou INSERT OVERWRITE, but you cannot always use OVERWRITE instead of INTO : in my use case, I use a statement that writes data into partitions that already exist and partitions that do not exist yet.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bbonnet Benjamin BONNET
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: