Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Later
-
Impala 2.6.0
-
None
Description
There was a crash on the EC2 stress test cluster. In the hs_err_pid file we find:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f61f241dad0, pid=18718, tid=140039557764864 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libstdc++.so.6+0x5cad0] __dynamic_cast+0x50 # # Core dump written. Default location: /var/log/impalad/core or core.18718 # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp
and in the impalad.ERROR log:
Log file created at: 2016/03/15 00:04:31 Running on machine: impala-stress-cdh5-trunk2-2.vpc.cloudera.com Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0315 00:04:31.310118 18718 logging.cc:120] stderr will be logged to this file. E0315 07:50:01.608139 6880 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:01.747846 6884 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:02.445477 8224 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:10.240705 8572 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:10.397904 7070 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:12.076678 7198 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:19.973160 8696 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:20.319783 8233 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:29.652158 7819 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:30.150673 7084 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:35.144541 9215 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:35.868957 8489 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-2.vpc.cloudera.com:22000 (connect() failed: Connection timed out) E0315 07:50:39.297054 7494 data-stream-sender.cc:259] channel send status: Couldn't open transport for impala-stress-cdh5-trunk2-3.vpc.cloudera.com:22000 (connect() failed: Connection timed out) Picked up JAVA_TOOL_OPTIONS: hdfsOpenFile(hdfs://impala-stress-cdh5-trunk2-1.vpc.cloudera.com:8020/user/hive/warehouse/tpch_3_nested_parquet.db/customer/000001_0): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error: FSDataInputStream#close error: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >' what(): boost: mutex lock failed in pthread_mutex_lock: Invalid argument ~
Attachments
Issue Links
- is blocked by
-
IMPALA-3299 Stress test failures: Couldn't open transport for impala-stress-cdh5-trunk2-5.vpc.cloudera.com:22000
- Resolved
Activity
How about we make this JIRA about the stress test not killing impalad and/or starting the next test until impalad is quiesced? It'd be good to avoid this kind of crash in the future, and who knows what the effect is on the next test round when it doesn't crash.
I think we should close this JIRA as "Won't Fix", but we could also close it when IMPALA-3299 is resolved (which should make us stop seeing this crash, if not actually address the root problem).
What is this JIRA tracking now? That we fix the stress test to wait for impalad to be idle before trying to restart it?
I'm marking this as depends on a new JIRA, IMPALA-3299, which is to investigate the job failures. Given that these crashes occur while shutting down Impala, I don't think it's worth fixing the crashes directly.
The private stress job I previously mentioned (http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/impala-stress-test-skye/9) crashed because I loaded the wrong branch, so ignore that failure.
I think I figured out roughly what causes the crash. Quite often, the stress job will fail because the impalads become unresponsive (but still alive), presumably due to admitting too many queries at once. http://sandbox.jenkins.cloudera.com/job/Impala-Stress-Test-EC2-CDH5-trunk/621/console is an example of this. I verified the impalads did not crash by noting that no core dumps are produced, and manually checking on the cluster machines that the impalad processes are still running.
http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-CDH5-trunk-nightly/ runs several stress jobs one after the other. The fourth job often fails as described above, and then the fifth job switches from a debug to a release build, meaning it will try to restart the cluster. I think restarting the cluster when its in the overloaded state causes the crash. I've verified this by manually running the two jobs one after the other, and noting that the debug impalad process produces a core dump when the second jobs runs. Sometimes the second job will fail trying to restart the cluster (presumably because it's overloaded), and sometimes it will successfully start the release build and continue running the stress test.
Given the types of errors we've seen, I suspect we're just hitting various edge cases that can happen when you tear down the process while it's running queries. Also notable is that the jobs aren't failing because of the crash, they're failing because the impalads become unresponsive.
I have a repro on a private stress cluster. Interestingly enough, trunk no longer seems to exhibit this crash, although it still fails, I think because it gets too overloaded and starts timing out (see http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/impala-stress-test-skye/8/console). However, when I load cf8c982, I see the crash (http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/impala-stress-test-skye/9/console).
I'll start binary searching the commits to see what introduced the crash. However, given that it went away, it might be some kind of timing issue unrelated to the specific commit. Either way we should understand what's causing it.
Just an update on this, I'm trying to spin up a private stress cluster so I can easily repro and binary search which commit introduced this. It was taking a while due to cloudcat issues but it looks like it's working now (http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/impala-stress-test-skye/6/). If this is successful, I'll be able to start narrowing down on the cause when I get back on Monday.
We never saw the bug in Impala 2.5.0, which we did plenty of stress testing on. So assuming that the problematic commit is in cf8c982 but not in the 2.5.0 branch, this gives us:
I00a43a69512c07eb351cf13bd1353226719744ca commit cf8c982c9afe0a3d1ccec3e0ae9b89860b44e3d2 Author: Michael Ho <kwho@cloudera.com> Date: Mon Feb 29 18:00:01 2016 -0800 IMPALA-3068: Don't call exit(1) on fatal errors. When hitting certain fatal errors, Impalad may call exit(1) directly or indirectly via macro like EXIT_WITH_ERROR() to exit Impalad. Usually, there are error messages logged before exit(1) is called but they may not have been flushed to the log file before exit occurs. In addition, exit(1) may trigger destruction of some global variables (e.g. static_mem_trackers_lock_) which can cause other threads to crash and core dump, potentially masking the original error. This change converts places in the code which call exit(1) to use LOG(FATAL) instead. LOG(FATAL) will call abort() instead of exit(1), so destructors won't be called on global variables. In addition, it will flush all logs before calling abort(). This is essentially the path the code takes when a DCHECK is triggered. This change also installs a "fatal error handler" for LLVM. It's invoked right before LLVM calls exit(1) when it hits a "fatal error". This hopefully helps capture some useful trace in the log in the rare case LLVM hits any "fatal error". Change-Id: I00a43a69512c07eb351cf13bd1353226719744ca Reviewed-on: http://gerrit.cloudera.org:8080/2364 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins I04846b45738be37d372927992ff666801a3b29ad commit 427d5227aeda804add4ba499532db01e214bf335 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Wed Mar 2 15:34:00 2016 -0800 IMPALA-2949: enable -Werror for debug builds Also clean up some headers a little by removing unneeded imports and moving a function that didn't need to be in header into .cc file. Change-Id: I04846b45738be37d372927992ff666801a3b29ad Reviewed-on: http://gerrit.cloudera.org:8080/2420 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins I0966a409f61ab5d54c09b71e9ed149d561fa43ae commit 6fbbbab0049faf1beb6c21e0c6f677999b3e848d Author: casey <casey@cloudera.com> Date: Fri Mar 4 00:45:54 2016 -0800 Fix typo in assigning the default IMPALA_BUILD_THREADS A '$' was missing so IMPALA_BUILD_THREADS would not be set. Also change to use the ':=' version since IMPALA_BUILD_THREADS is defined but null in people's environments (due to the export in the line below). Change-Id: I0966a409f61ab5d54c09b71e9ed149d561fa43ae Reviewed-on: http://gerrit.cloudera.org:8080/2454 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins I186174db136694d3df04d9159362c6eeaa70b5b8 commit e81dd9f3bb68cf4e9d46fd989f6996cc5f8601f5 Author: Michael Brown <mikeb@cloudera.com> Date: Fri Feb 12 12:59:27 2016 -0800 Leopard front end: protect cases where report and schedule paths are absent If the leopard front end is run on a pristine system, the front end will raise exceptions, because the directories containing reports and schedules do not yet exist. Since it's the controller's responsibility to create and populate these filesystem artifacts, in the frontend, let's just protect against the exception and log warnings. Testing: Before the fix, the reload_reports thread would immediately raise an exception and the thread would die; loading the Web app in a browser would cause an internal server error. After the fix, the reload_reports thread logs a warning but continues to run, and loading the Web app in a browser presents an appropriately empty listing of run names and schedules. Change-Id: I186174db136694d3df04d9159362c6eeaa70b5b8 Reviewed-on: http://gerrit.cloudera.org:8080/2164 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins I3639dcf86c14778967c0079b8dbc222a4516cf05 commit 528ed796822b895fc7e429909d49d9c62713e892 Author: Dan Hecht <dhecht@cloudera.com> Date: Wed Mar 9 14:34:10 2016 -0800 Remove AMD Opteron Rev E workaround from atomicops Impala doesn't run on Opteron Rev E because those CPUs don't support SSSE3. So let's not pay the price of this cmp/branch on the atomics path (which is used by our SpinLock, and I plan to make our Atomic class use these functions in a future change). Note that gperfutil also removed this RevE workaround since these CPUs are getting old and apparently many kernels don't have the workaround anyway. Let's call the init function for atomicops, even though it's basically a no-op for 64-bit mode (since these features are always available). But this future proofs the code a bit better. Change-Id: I3639dcf86c14778967c0079b8dbc222a4516cf05 Reviewed-on: http://gerrit.cloudera.org:8080/2516 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins I3b02931a97f695a3002fede950ce24be1c20da40 commit d35b32c8ad8ca02f6ebaf2f12c4f08746a8fb9ad Author: Casey Ching <casey@cloudera.com> Date: Tue Mar 8 23:04:18 2016 -0800 Update gitignore for test resources Change-Id: I3b02931a97f695a3002fede950ce24be1c20da40 Reviewed-on: http://gerrit.cloudera.org:8080/2496 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com> I3bafc6aaf80d1d8d2a634d120d9dbdb954d3f0c4 commit 158b9d88b795c4dc3d100d4895c6f858f17976fd Author: Thomas Tauber-Marshall <tmarshall@cloudera.com> Date: Thu Feb 11 14:44:17 2016 -0800 IMPALA-1772: Add additional date/time functions. Implemented the 'millisecond' built-in function, which takes a timestamp and returns an integer representing its millisecond portion. Other functions pending. Change-Id: I3bafc6aaf80d1d8d2a634d120d9dbdb954d3f0c4 Reviewed-on: http://gerrit.cloudera.org:8080/2148 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins I468355133fcceec12a8c7e8b4f46bf9c932aaf2c commit 8ba984795ce153bc5e6b08f50b911542cd496f9c Author: Casey Ching <casey@cloudera.com> Date: Mon Mar 7 12:12:08 2016 -0800 Reduce the impalad mem limit for the mini-cluster The default 80% process limit basically never gets used because there are 3 impalads in the default mini-cluster which means 240% of the system memory. Either impalad gets OOM killed or the system will have a huge swap. Reducing the mini-cluster mem limit seems like a better alternative. Change-Id: I468355133fcceec12a8c7e8b4f46bf9c932aaf2c Reviewed-on: http://gerrit.cloudera.org:8080/2472 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins I4f5c3d1df7ac979f6a98d8d320d717d30779bae6 commit 73b63d2c72ce4fd4a32e8f6beee012911ec99916 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Thu Mar 10 13:25:23 2016 -0800 Refer to local rather than member in PAGG::ProcessBatchStreaming() The code referred to the correct HashTableCtx via the wrong path. LLVM is able to generate slightly better code after the fix. Change-Id: I4f5c3d1df7ac979f6a98d8d320d717d30779bae6 Reviewed-on: http://gerrit.cloudera.org:8080/2524 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins I5bfeee04e17cc6bb95c11669dc9f7c650e48bd32 commit 77aab8e166fc972b787689b9c25a5d75e0def93b Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Fri Mar 4 14:58:41 2016 -0800 IMPALA-2892: buffered-tuple-stream-ir.cc is not cross-compiled The file was never added to impala-ir.cc so was never cross-compiled. I did a quick benchmark and saw no benefit from cross-compiling the code, so this patch just moves the contents of the -ir.cc to the .cc file. Change-Id: I5bfeee04e17cc6bb95c11669dc9f7c650e48bd32 Reviewed-on: http://gerrit.cloudera.org:8080/2465 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins I619c9e708794aaf1138bdc0087fbe82539c0817e commit 72b79b93a01096806f61aa6d5377965105e117d8 Author: Skye Wanderman-Milne <skye@cloudera.com> Date: Tue Mar 8 10:11:39 2016 -0800 IMPALA-3132: link ImpalaUdf first in UDF tests ImpalaUdf provides test implementations of MemTracker, FreePool, and RuntimeState for use with UDF tests. We attempted to link these test definitions in udf-test, uda-test, and aggregate-functions-test by replacing the Udf lib dependency with ImpalaUdf. While this caused us to use the correct compiled version of udf.cc, the linker would still use the Runtime implementations of the class methods rather than the ImpalaUdf definitions, since Runtime case first in the link order. This patch addresses this by making ImpalaUdf the first linked lib for UDF tests (and as a result we no longer need to remove the Udf lib). It also adds calls to InitCommonRuntime() in the affected tests so their output isn't redirected by default. Ideally the UDF tests wouldn't need to depend on Runtime at all (or specifically the parts redefined by ImpalaUdf), but it will require major refactoring to make this possible. Change-Id: I619c9e708794aaf1138bdc0087fbe82539c0817e Reviewed-on: http://gerrit.cloudera.org:8080/2471 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins I75a7e6a1d96e5d92368f73c1e5e6a6b288932497 commit 0c7bc7eef6600ebb696c316d443c317fd7090d9a Author: casey <casey@cloudera.com> Date: Thu Mar 3 16:52:31 2016 -0800 Misc query generator improvements Changes: 1) Improve generation of boolean expressions to be more realistic. Previously functions like And and Or were very unlikely to be chosen because they only use Booleans and the generator prefers functions that use Ints. This is especially bad now that ON expressions may be arbitrary (previously they were hard coded to use Ands and Equals). Now And/Or are special cased when generating a Boolean expressions so they'll be more likely to be used. 2) Randomly choose query options. This is very basic. With this change Impala gets OOM killed because there are 3 nodes running locally and each of them is configured to use 80% of the total memory. Another update will be needed to start the mini-cluster with less memory. 3) Fix bug where a generated function tree would not be a tree at all, it would just be one function. Change-Id: I75a7e6a1d96e5d92368f73c1e5e6a6b288932497 Reviewed-on: http://gerrit.cloudera.org:8080/2445 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins I87b9f0e765f8fdbc5286edcb7a1edf7e082b54d0 commit 470b259aa62164182aaa9be74fa2fda22bd97a06 Author: Casey Ching <casey@cloudera.com> Date: Tue Mar 8 14:18:38 2016 -0800 Query Gen: Set randomization seed during data generation The data generation was written to use a randomization seed but the seed was never actually passed to it. Now the data is the same when the generator is run multiple times. Change-Id: I87b9f0e765f8fdbc5286edcb7a1edf7e082b54d0 Reviewed-on: http://gerrit.cloudera.org:8080/2493 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com> I8b351b7f467e8bef0c256dc43cea325d7f177edf commit 0d3a0902c7c2b23cc3ccd35899b836004f67ebe0 Author: Alex Behm <alex.behm@cloudera.com> Date: Wed Mar 2 14:19:55 2016 -0800 Improve the SQL for nested TPCH-Q18. Marcel spotted that nested TPCH-Q18 can be expressed with more efficient SQL. Results on nested TPCH-300: Before 160s After 100s Change-Id: I8b351b7f467e8bef0c256dc43cea325d7f177edf Reviewed-on: http://gerrit.cloudera.org:8080/2418 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins I8eabb5c062ee223c5de9df40aacfdc9dcda5ba63 commit ac6c3de0d3cebd72c3825faa161001e12ba82394 Author: Skye Wanderman-Milne <skye@cloudera.com> Date: Wed Feb 10 18:21:09 2016 -0800 Specify whether to clone IR function in GetFunction() instead of ReplaceCallSites() This patch changes the signature of LlvmCodegen::GetFunction() to include whether a clone of the requested function should be returned, and ReplaceCallSites() always modifies the input function in-place. Before, ReplaceCallSites() could optionally clone the input function. Cloning the function in GetFunction() and occasionally via explicit LlvmCodegen::CloneFunction() calls makes the logic easier to follow and get right (for example, before we had cases where we were unnecessarily cloning the function multiple times across multiple ReplaceCallSites() calls). Change-Id: I8eabb5c062ee223c5de9df40aacfdc9dcda5ba63 Reviewed-on: http://gerrit.cloudera.org:8080/2112 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins Ia1b8b6de080f7bdc59a5e350dadd36502aaecc4e commit be6eb666d5d2c8296236599c1f30fcb992450d2a Author: Juan Yu <jyu@cloudera.com> Date: Wed Mar 2 10:19:25 2016 -0800 Bump Impala version to 2.6.0 Change-Id: Ia1b8b6de080f7bdc59a5e350dadd36502aaecc4e Reviewed-on: http://gerrit.cloudera.org:8080/2409 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Juan Yu <jyu@cloudera.com> Ia7bd39861162e85b9091a0ef41ed1c317737e66c commit 13e3818a4577437a3e197f4a5f482dac350bf120 Author: casey <casey@cloudera.com> Date: Wed Mar 9 12:57:55 2016 -0800 Misc stress test updates Changes: 1) IMPALA-3121: Don't ignore backend <-> backend timeout errors. 2) Only collect query profiles if they are really needed. A v-tune profile showed 10% cpu time collecting query profiles. Previously the stress test would unconditionally collect a query profile even if the profile wouldn't be used. 3) Add CLI options to control runtime info reliability. The work was already done but there were no CLI params (they had to be hard coded). Now --samples and --max-conflicting-samples are available. Change-Id: Ia7bd39861162e85b9091a0ef41ed1c317737e66c Reviewed-on: http://gerrit.cloudera.org:8080/2504 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com> Ib1e5322df1d5936ee9679cdc0031772b5f44b135 commit e2f12249802dc8374c0a49b6f49ad76acca4ef82 Author: Dan Hecht <dhecht@cloudera.com> Date: Tue Mar 8 17:54:14 2016 -0800 IMPALA-3136: Use gutil SpinLock Impala's SpinLock implementation does a busy wait (it yields, but won't block). This is bad because if the lock holder gets delayed (say, due to extra work in malloc), then a bunch of threads can start spinning (stealing cycles from the lock holder) and we can fall off a cliff. Instead, let's use gutil's SpinLock implementation. That implementation is adaptive -- it'll spin for a little while but then fall back to waiting on a futex. Modify the lock-benchmark to include the no-contention single threaded case, which is an important case (locks should usually be uncontended). (On my machine, the previous SpinLock implementation was actually slower than boost::mutex in this case). The new lock is a bit slower in the contended case, but that's expected since the slow path will be a bit more complicated in order to deal with waiters. The added and updated gutil files come from the kudu master branch (044b2c1) with modifications only to the include file paths (remove 'kudu' directory), with one exception: dynamic_annotations.h has symbol definitions that conflict with some llvm headers (magic symbols used by valgrind). Avoid these conflicts with a slight change to dynamic_annotations.h to prefer the llvm way of doing things (weak symbols) when available. Change-Id: Ib1e5322df1d5936ee9679cdc0031772b5f44b135 Reviewed-on: http://gerrit.cloudera.org:8080/2507 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins Ic6603624cb2abc5c8de6f15c9ced73dc3b25a893 commit caf2a72c7f14c9961e55a6384885d51dea95f9d1 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Tue Mar 1 08:53:18 2016 -0800 Remove redundant row copy in DataStreamSender RowBatch::CopyRow() does a shallow copy of tuple pointers, which is redundant with the tuple pointer copies in the loop immediately after it. Change-Id: Ic6603624cb2abc5c8de6f15c9ced73dc3b25a893 Reviewed-on: http://gerrit.cloudera.org:8080/2371 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins Ic70400407b7662999332448f4d1bce2cc344ca89 commit 961b3a3a9ad2bc423d61f0ad354d5698111c0434 Author: Michael Ho <kwho@cloudera.com> Date: Thu Feb 11 15:32:04 2016 -0800 IMPALA-2399: Check for mem limit in allocations in parquet scanner and decompressor This change replaces all calls to MemPool::Allocate() with MemPool::TryAllocate() in the parquet scanner and the decompressor. Also streamline CheckQueryState() to avoid unnecessary spinlock acquisition for the common case when there is no error. Also removes some dead code in the text converter. MemPool::Allocate() is also updated to return a valid pointer instead of NULL when the allocation size is zero. NULL is only returned during allocation failure. This change also updates CollectionValueBuilder::GetFreeMemory() to return Status in case it exceeds memory limit. As part of the change, the max allocation limit (2 GB) is also removed from it as 64-bit allocations are supported in MemPool with this change. Change-Id: Ic70400407b7662999332448f4d1bce2cc344ca89 Reviewed-on: http://gerrit.cloudera.org:8080/2203 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins Ic87f8e6eeae97791c9e3d69355aac45d366a1882 commit fc349985a89cbaad12a81b489caa8d93f4190bc7 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Tue Feb 16 13:56:10 2016 -0800 Changes to allow running stress test against MiniCluster Miscellaneous fixes to allow running the binary mem_limit search against a local mini cluster of varying size. Change-Id: Ic87f8e6eeae97791c9e3d69355aac45d366a1882 Reviewed-on: http://gerrit.cloudera.org:8080/2209 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins Icb823dc47a51874b0f8a0b20f966a556752f7796 commit ac430afed4c136e81ff9b5cb698a63766b7db8ee Author: Casey Ching <casey@cloudera.com> Date: Wed Feb 3 14:57:25 2016 -0800 Stress test: Fix stack trace collection A lot of stuff got messed up during the switch to the cluster model... Changes: 1) find_crashed_impalads() returned a list but the caller expected a dict. 2) for_each_impalad() ignored the parameter 'impalads' and instead used all impalads in the cluster. 3) find_last_backtrace() returned the oldest core dump instead of the newest. 4) num_successive_errors_needed_to_abort was effectively hard-coded to 2. I'm not sure how that happened. 5) Catch EOFError when getting a query from the work queue. This happens when the work queue is shutdown but there are workers waiting for an item. 6) Ignore connection errors due to an unresponsive impalad. When the load on an impalad get very high it randomly stops responding to client requests. Reducing the load seems to help. 7) Added various log messages. Change-Id: Icb823dc47a51874b0f8a0b20f966a556752f7796 Reviewed-on: http://gerrit.cloudera.org:8080/2176 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com> Iddfff1feef0b59d407994ad3bc560166acbfa623 commit 9e7d737ba6051d5aa178f81dec0a048af6bee13d Author: Michael Ho <kwho@cloudera.com> Date: Fri Feb 26 16:03:54 2016 -0800 IMPALA-561: Allow multiple callbacks in a thread resource pool. Previously, thread resource manager only supports a single callback for each resource pool. The callback is invoked when a thread token is available. This mostly works as scan node is the only consumer and there is usually one scan node in a plan fragment. As shown in IMPALA-3064 and IMPALA-561, it's possible to generate a plan fragment with more than one scan nodes. In which case, one of the scan nodes may be running with single thread and in debug builds, a DCHECK will be hit. This change fixes the problem by allowing more than one callbacks in a given resource pool. The thread resource manager will go through all the registered callbacks in round robin fashion. This change also adds a missing thread token release call in HdfsScanNode::ThreadTokenAvailableCb(). Change-Id: Iddfff1feef0b59d407994ad3bc560166acbfa623 Reviewed-on: http://gerrit.cloudera.org:8080/2430 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins Idff06791e9d880cc8ddf54c0c977a556d3701bea commit 72971881b69dc71440b12f454b523a76990cd841 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Thu Mar 3 14:12:32 2016 -0800 IMPALA-2728: reenable mem limit test now that it is stable The test has not xfailed in a long time, so we believe that various memory usage fixes have fixed the flakiness. Change-Id: Idff06791e9d880cc8ddf54c0c977a556d3701bea Reviewed-on: http://gerrit.cloudera.org:8080/2442 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins Iea04ad2e4b365564ee71082657262485d3a85446 commit d25939759c9a4a3c5cdebaa81990a419c65d6951 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Sun Feb 14 16:30:23 2016 -0800 Reduce size of cross-compiled IR Disable the instnamer pass to reduce bitcode file size. Enable inlining with a low threshold to inline many trivial functions in the cross-compiled IR (e.g. scoped_ptr accessor functions). Add additional IR_ALWAYS_INLINE annotations to trivial functions that are called on hot paths. This means that the functions will be inlined at compile time and codegen has to do less work at runtime. The IR module size is reduced from ~1.9MB to ~1.4MB with this change and codegen time is slightly reduced. The number of functions is reduced from 5698 to 3883. Change-Id: Iea04ad2e4b365564ee71082657262485d3a85446 Reviewed-on: http://gerrit.cloudera.org:8080/2464 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins
On a different node I found:
F0315 07:50:52.022089 21725 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala20ConditionalFunctions10ZeroIfNullEPN10impala_udf15FunctionContextERKNS1_9DoubleValEF0315 07:50:52.022593 21719 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala13CastFunctions15CastToStringValEPN10impala_udf15FunctionContextERKNS1_9BigIntValEF0315 07:50:52.029279 21765 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala13SampleValLessIN10impala_udf9BigIntValEEEbRKNS_15ReservoirSampleIT_EES7_F0315 07:50:52.030951 21624 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN5boost24enable_current_exceptionINS_16exception_detail19error_info_injectorINS_16bad_lexical_castEEEEENS1_10clone_implIT_EERKS6_F0315 07:50:52.037485 21764 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala13MathFunctions5Log10EPN10impala_udf15FunctionContextERKNS1_9DoubleValEF0315 07:50:52.042593 21779 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala18AggregateFunctions21FirstValRewriteUpdateIN10impala_udf12TimestampValEEEvPNS2_15FunctionContextERKT_RKNS2_9BigIntValEPS6_F0315 07:50:52.046499 21776 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZStlsIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_St8_SetfillIS3_EF0315 07:50:52.049159 21726 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN6impala11InPredicate24SetLookupPrepare_booleanEPN10impala_udf15FunctionContextENS2_18FunctionStateScopeEF0315 07:50:52.107755 21812 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZNK10impala_udf9StringValeqERKS0_F0315 07:50:52.164721 21721 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN5boost6detail17sp_counted_impl_pINS_10local_time21custom_time_zone_baseIcEEE7disposeEvF0315 07:50:52.174499 21715 llvm-codegen.cc:263] Check failed: codegen->loaded_functions_[FN_MAPPINGS[j].fn] == NULL Duplicate definition found for function : _ZN5boost9date_time17nth_kday_of_monthINS_9gregorian4dateEED1Ev
The crashes occurred on both debug and release builds.
I'm copying the cores and logs over to impala-desktop
dhecht, if this issue is not tracking a specific crash, can you please remove the "crash" label.