Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10310

Couldn't skip rows in parquet file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 3.4.0
    • Impala 4.0.0, Impala 3.4.2
    • Backend
    • None
    • ghx-label-4

    Description

      When hdfs-parquet-scanner thread assigned ScanRanges that contains multi RowGroups,
      error process skip rows logic with PageIndex.

      Below is the error log:

      I1028 17:59:16.694046 1414911 status.cc:68] 1447f227b73a4d78:92d9a82600000fd1] Could not read definition level, even though metadata states there are 0 values remaining in data page. file=hdfs://path/to/file
          @           0xbf4286
          @          0x17bc0eb
          @          0x17737f7
          @          0x1773a0e
          @          0x1773d8a
          @          0x1774028
          @          0x17b9517
          @          0x174a22b
          @          0x17526fe
          @          0x140a78a
          @          0x1525908
          @          0x1526a03
          @          0x10e6169
          @          0x10e84c9
          @          0x10c7a86
          @          0x13750ba
          @          0x1375f89
          @          0x1b49679
          @     0x7ffb2eee1e24
          @     0x7ffb2bad935c
      I1028 17:59:16.694074 1414911 status.cc:126] 1447f227b73a4d78:92d9a82600000fd1] Couldn't skip rows in file hdfs://path/to/file
          @           0xbf5259
          @          0x1773a8a
          @          0x1773d8a
          @          0x1774028
          @          0x17b9517
          @          0x174a22b
          @          0x17526fe
          @          0x140a78a
          @          0x1525908
          @          0x1526a03
          @          0x10e6169
          @          0x10e84c9
          @          0x10c7a86
          @          0x13750ba
          @          0x1375f89
          @          0x1b49679
          @     0x7ffb2eee1e24
          @     0x7ffb2bad935c
      I1028 17:59:16.694101 1414911 runtime-state.cc:207] 1447f227b73a4d78:92d9a82600000fd1] Error from query 1447f227b73a4d78:92d9a82600000000: Couldn't skip rows in file hdfs://path/to/file.
      

      On debug build the error log is that:

      F1030 14:06:38.700459 3148733 parquet-column-readers.cc:1258] 994968c01171b0bc:eab92b3f0000000a] Check failed: num_buffered_values_ >= num_rows (20000 vs. 40000) 
      *** Check failure stack trace: ***
          @          0x4e9322c  google::LogMessage::Fail()
          @          0x4e94ad1  google::LogMessage::SendToLog()
          @          0x4e92c06  google::LogMessage::Flush()
          @          0x4e961cd  google::LogMessageFatal::~LogMessageFatal()
          @          0x2bfa2c3  impala::BaseScalarColumnReader::SkipTopLevelRows()
          @          0x2bf9fcc  impala::BaseScalarColumnReader::StartPageFiltering()
          @          0x2bf99b4  impala::BaseScalarColumnReader::ReadDataPage()
          @          0x2bfbad8  impala::BaseScalarColumnReader::NextPage()
          @          0x2c5bc8c  impala::ScalarColumnReader<>::ReadValueBatch<>()
          @          0x2c1a67a  impala::ScalarColumnReader<>::ReadNonRepeatedValueBatch()
          @          0x2bae010  impala::HdfsParquetScanner::AssembleRows()
          @          0x2ba8934  impala::HdfsParquetScanner::GetNextInternal()
          @          0x2ba68ac  impala::HdfsParquetScanner::ProcessSplit()
          @          0x27d8d0b  impala::HdfsScanNode::ProcessSplit()
          @          0x27d7ee0  impala::HdfsScanNode::ScannerThread()
          @          0x27d723d  _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_18ThreadResourcePoolEENKUlvE_clEv
          @          0x27d9831  _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_18ThreadResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
          @          0x1fc4d9b  boost::function0<>::operator()()
          @          0x258590e  impala::Thread::SuperviseThread()
          @          0x258db92  boost::_bi::list5<>::operator()<>()
          @          0x258dab6  boost::_bi::bind_t<>::operator()()
          @          0x258da79  boost::detail::thread_data<>::run()
          @          0x3db95c9  thread_proxy
          @     0x7febc66e6e24  start_thread
          @     0x7febc313135c  __clone
      Picked up JAVA_TOOL_OPTIONS: -Xms34359738368 -Xmx34359738368 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/28ecfee554b03954bac9e77a73f4ce0c_pid2802027.hprof
      Wrote minidump to /path/to/minidumps/74dae046-c19d-4ad5-ea2603ae-ff139f7e.dmp
      

       

      All parquet files are generated by spark with 128MB size of row group as default configuration.

      Attachments

        Activity

          People

            guojingfeng guojingfeng
            guojingfeng guojingfeng
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: