Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
-
Physical lab configurations.
8 baremetal servers,
Each 56 Cores, 384GB RAM, RHEL 7.4
Kernel : 3.10.0-862.9.1.el7.x86_64
redhat-release-server.x86_64 7.4-18.el7Kubernetes info:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}Physical lab configurations. 8 baremetal servers, Each 56 Cores, 384GB RAM, RHEL 7.4 Kernel : 3.10.0-862.9.1.el7.x86_64 redhat-release-server.x86_64 7.4-18.el7 Kubernetes info: Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Description
Launched spark thrift in kubernetes cluster with dynamic allocation enabled.
Configurations set :
spark.executor.memory=35g
spark.executor.cores=8
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=10
spark.dynamicAllocation.cachedExecutorIdleTimeout=15
spark.driver.memory=10g
spark.driver.cores=4
spark.sql.crossJoin.enabled=true
spark.sql.starJoinOptimization=true
spark.sql.codegen=true
spark.rpc.numRetries=5
spark.rpc.retry.wait=5
spark.sql.broadcastTimeout=1200
spark.network.timeout=1800
spark.dynamicAllocation.maxExecutors=15
spark.kubernetes.allocation.batch.size=2
spark.kubernetes.allocation.batch.delay=9
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kubernetes.node.selector.is_control=false
Tried to run TPCDS queries , on a 1TB parquet snappy data .
Found that as the execution progress, the tasks are done by a single executor ( executor 53 ) and no new executors are getting spawned, even though there is enough resources to spawn more executors.
Tried to manually delete the executor pod 53 and saw that no new executor has been spawned to replace the one which is running.
Attcahed the