Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3582

Incomplete sidecar data returned by RpcContext::GetInboundSidecar()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • rpc
    • None

    Description

      Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read serialized thrift object from KRPC, then do thrift deserialization. (See GetSidecar() at
      https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)

      In a customer reported cases, extra workloads were added to Impala cluster, which caused long delay for KRPCs between Impala daemons. The long delay caused KRPCs been cancelled, hence impala query failures.

      impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: Aborted:
      

      Then extra workloads were removed and Impala cluster was restarted. During restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of core files and log messages shows that impala daemons received incomplete data from KRPC sidecar. The incomplete data did not cause thrift deserialization failure so the valid but incomplete data was not captured and handled properly.
      See impala Jira: IMPALA-13107. The issue could not be re-produced locally.

      A quick fixing from Impala side was merged to mitigate the crash issue. Need to look into this issue further from KRPC internal.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wzhou Wenzhe Zhou
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: