Uploaded image for project: 'IMPALA'
  2. IMPALA-2829

SEGV in AnalyticEvalNode touching NULL input_stream_



    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.3.0, Impala 2.5.0
    • Fix Version/s: Impala 2.5.0, Impala 2.3.2
    • Component/s: Backend
    • Labels:


      A crash was reported in the following stack:

      Stack: [0x00007fe1c7c8b000,0x00007fe1c848c000],  sp=0x00007fe1c8489bd0,  free space=8186k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [impalad+0x128fb26]  impala::BufferedTupleStream::rows_returned() const+0xc
      C  [impalad+0x12bd3b3]  impala::AnalyticEvalNode::GetNext(impala::RuntimeState*, impala::RowBatch*, bool*)+0x821
      C  [impalad+0x115f478]  impala::PlanFragmentExecutor::GetNextInternal(impala::RowBatch**)+0xec
      C  [impalad+0x115dc92]  impala::PlanFragmentExecutor::OpenInternal()+0x272
      C  [impalad+0x115d958]  impala::PlanFragmentExecutor::Open()+0x39e
      C  [impalad+0xf30d88]  impala::FragmentMgr::FragmentExecState::Exec()+0x26
      C  [impalad+0xf293a8]  impala::FragmentMgr::FragmentExecThread(impala::FragmentMgr::FragmentExecState*)+0x4c

      The issue may have been introduced in a recent fix for:
      IMPALA-2378: Part 2, IMPALA-2481: delete BufferedTupleStreams attached to batches
      commit 916f3b29

      I can reproduce this with the following query:

      select max(t3.c1), max(t3.c2)
      from (
        avg( t1.timestamp_col )
          over (order by t1.id, t2.id rows between 5000 following and 50000 following) c1,
        avg( t2.timestamp_col )
          over (order by t1.id, t2.id rows between 5000 following and 50000 following) c2
        from alltypesagg t1 join alltypesagg t2 where t1.int_col = t2.int_col
      ) t3;

      The issue has to do with allocated memory that gets passed to the output row batch. Normally memory gets allocated from a mempool and then transferred to the output row batch when it reaches 8mb. This may happen many times during the execution of the analytic node, and it works fine in the general case. However, when this transfer to the output row batch is supposed to happen at eos, we end up trying to do this transfer twice, which is where we end up touching a NULL pointer.

      It shouldn't happen too frequently (none of our existing tests hit it), but it is unfortunately hard to predict when this will happen because it really depends on the query and the data.

      There isn't an easy general workaround, but small changes that affect the cardinality of the data or the output tuple size of the analytic eval node may change when the data transfer happens and thus avoiding the crash.


          Issue Links



              • Assignee:
                mjacobs Matthew Jacobs
                mjacobs Matthew Jacobs
              • Votes:
                0 Vote for this issue
                6 Start watching this issue


                • Created: