Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-313

Unable to SELECT COUNT(*) from a MOR realtime table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      While I query like this in Hive:

      SELECT COUNT(*) FROM hudi_test_rt;
      OR:
      SELECT COUNT(1) FROM hudi_test_rt;

      It returns:

      2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: java.io.IOException: java.lang.NumberFormatException: For input string: ""
          at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
          at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
          at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
          at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
          at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
          at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
          at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
          at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
          at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
          at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
          at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
          at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
          at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: java.lang.NumberFormatException: For input string: ""
          at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
          at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
          at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
          ... 19 more
      Caused by: java.lang.NumberFormatException: For input string: ""
          at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
          at java.lang.Integer.parseInt(Integer.java:592)
          at java.lang.Integer.parseInt(Integer.java:615)
          at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
          at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
          at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
          at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
          at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
          at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
          at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
          at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
          ... 20 more
      

      Some investigations:

      Basically, Hive try to update projection column ids during each getRecordReader stage. But for COUNT(*) or COUNT(1), they don't need any projection column id which is an empty string.
      And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi required columns in the projection column ids and make the column ids like "2,0,3".
      Therefore, when Hive trying to update projection column ids, it combines an empty string with Hudi required columns ids and finally get the column ids like ",2,0,3". This first comma will cause an error during the parsing stage.

       

      One possible solution is to add a method to check if the projection column ids start with comma. If it is start with comma, then remove first comma.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wenningd Wenning Ding
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m