Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10342

Flooding of UDF warnings crash the coordinator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • None
    • ghx-label-8

    Description

      Hi, when encounting error, both `get_json_object()` and `DecimalOperators::IntToDecimalVal` will raise warning.

      During to their stateless nature, The warning flood will easily overwhelm cluster's processing capacity.

      To be specific, we have observed these bottlenecks:

      Exchange Receiver:   the default value for `rpc_max_message_size` is 50MB. The flooding warning messages carried by ReportExecStatusPB may exceed that limit, causing profile-less status report. Or,  if the report message size is somehow under the limit, the bandwidth consumption is also non-trivial.

      Storage: like IMPALA-5256 , flooding warnings produce huge log files since `stdout/stderr` won't be redirected when glog is rolling logs.  Under this circumstance, we had enough of clearing log files and restarting executors. 

      Coordinator: runtime profiles will be serialized to thrift and stored in Coordinator's memory. The warning flood will make `Untracked Memory` rising rapidly. I have made a heap profile(with pprof) and found most memory were used by RuntimeProfile and Strings. 

       

       

      1 preliminary Solution:

      We suffered a lot from this problem, and we have came out with an preliminary solution. 

      1. We have a straightforward solution by muting the AddWarning()
      2. Introduced a query option to re-enable the warning when needed.

       Testing:

      With muted warning messages, we find the burden of C nodes is highly alleviated and heap profiles no longer bound to RuntimeProfile.

       

      Update

      Encountered a similar crash case with  `get_json_object()` query, each time the query submitted, the Coordinator crashes.

      Log:

      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
      #
      # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )
      # Problematic frame:
      # C  [impalad+0x2864dca]  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
      #
      # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
      #
      # An error report file with more information is saved as:
      # /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
      #
      # If you would like to submit a bug report, please visit:
      #   http://bugreport.java.com/bugreport/crash.jsp
      # The crash happened outside the Java Virtual Machine in native code.
      # See problematic frame for where to report the bug.
      #
      d. The connection had 2 associated session(s).
      
      I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
      std::bad_alloc
          @           0xbf4ef9
          @          0x1352d5f
          @          0x1352eaf
          @          0x11986de
          @          0x122516c
          @          0x1225515
          @          0x137ee36
          @          0x13801a0
          @          0x139682f
          @          0x139915a
          @          0x1399784
          @     0x7f34791e0e24
          @     0x7f3475dd835c
      
      
      

       StackTrace:

      Stack: [0x00007eff72b08000,0x00007eff73309000],  sp=0x00007eff733006b0,  free space=8161k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [impalad+0x2864dca]  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
      C  [impalad+0x286519f]  tcmalloc::ThreadCache::Scavenge()+0x3f
      C  [impalad+0x29a211a]  operator delete(void*)+0x32a
      C  [impalad+0xae94d9]  impala::TRuntimeProfileNode::~TRuntimeProfileNode()+0x289
      C  [impalad+0xae4987]  impala::TRuntimeProfileTree::~TRuntimeProfileTree()+0x47
      C  [impalad+0xf5280a]  impala::RuntimeProfile::Compress(std::vector<unsigned char, std::allocator<unsigned char> >*) const+0x3aa
      C  [impalad+0xf52eb0]  impala::RuntimeProfile::SerializeToArchiveString(std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*) const+0x40
      C  [impalad+0xd986df]  impala::ImpalaServer::GetRuntimeProfileOutput(impala::TUniqueId const&, std::string const&, impala::TRuntimeProfileFormat::type, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::TRuntimeProfileTree*, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x5bf
      C  [impalad+0xe2516d]  impala::ImpalaHttpHandler::QueryProfileHelper(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*, impala::TRuntimeProfileFormat::type)+0x4ed
      C  [impalad+0xe25516]  impala::ImpalaHttpHandler::QueryProfileEncodedHandler(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x16
      C  [impalad+0xf7ee37]  impala::Webserver::RenderUrlWithTemplate(sq_connection const*, kudu::WebCallbackRegistry::WebRequest const&, impala::Webserver::UrlHandler const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::ContentType*)+0x177
      C  [impalad+0xf801a1]  impala::Webserver::BeginRequestCallback(sq_connection*, sq_request_info*)+0x951
      C  [impalad+0xf96830]  kudu::StringGauge::~StringGauge()+0x100
      
      

      Attachments

        1. image-2021-04-28-20-20-45-798.png
          391 kB
          Fifteen
        2. image-2020-11-23-09-57-49-840.png
          51 kB
          Fifteen
        3. impalad-ram-profile.pdf
          22 kB
          Fifteen
        4. image-2020-11-19-17-30-22-918.png
          396 kB
          Fifteen

        Activity

          People

            fifteencai Fifteen
            fifteencai Fifteen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: