Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
3.3.1
-
None
-
None
-
From top to down.
Kubernetes version 1.18.20
Spark Version: 2.4.4
Kubernetes Setup: Pod with serviceAccountName that binds with IAM Role using IRSA (EKS Feature).
apiVersion: v1 automountServiceAccountToken: true kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::999999999999:role/EKSDefaultPolicyFor-Spark name: spark namespace: spark
AWS Setup:
IAM Role with permissions over the S3 Bucket
Bucket with permissions granted over the IAM Role.
Code:
def run_etl(): sc = SparkSession.builder.appName("TXD-PYSPARK-ORACLE-SIEBEL-CASOS").getOrCreate() sqlContext = SQLContext(sc) args = sys.argv load_date = args[1] # Ej: "2019-05-21" output_path = args[2] # Ej: s3://mybucket/myfolder print(args, "load_date", load_date, "output_path", output_path) sc._jsc.hadoopConfiguration().set( "fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain" ) sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true") sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A") session = boto3.session.Session() client = session.client(service_name='secretsmanager', region_name="us-east-1") get_secret_value_response = client.get_secret_value( SecretId="Siebel_Connection_Info" ) secret = get_secret_value_response["SecretString"] secret = json.loads(secret) db_username = secret.get("db_username") db_password = secret.get("db_password") db_host = secret.get("db_host") db_port = secret.get("db_port") db_name = secret.get("db_name") db_url = "jdbc:oracle:thin:@{}:{}/{}".format(db_host, db_port, db_name) jdbc_driver_name = "oracle.jdbc.OracleDriver" dbtable = """(SELECT * FROM SIEBEL.REPORTE_DE_CASOS WHERE JOB_ID IN (SELECT JOB_ID FROM SIEBEL.SERVICE_CONSUMED_STATUS WHERE PUBLISH_INFORMATION_DT BETWEEN TO_DATE('{} 00:00:00', 'YYYY-MM-DD HH24:MI:SS') AND TO_DATE('{} 23:59:59', 'YYYY-MM-DD HH24:MI:SS')))""".format(load_date, load_date) df = sqlContext.read\ .format("jdbc")\ .option("charset", "utf8")\ .option("driver", jdbc_driver_name)\ .option("url",db_url)\ .option("dbtable", dbtable)\ .option("user", db_username)\ .option("password", db_password)\ .option("oracle.jdbc.timezoneAsRegion", "false")\ .load() # Particionado a_load_date = load_date.split('-') df = df.withColumn("year", lit(a_load_date[0])) df = df.withColumn("month", lit(a_load_date[1])) df = df.withColumn("day", lit(a_load_date[2])) df.write.mode("append").partitionBy(["year", "month", "day"]).csv(output_path, header=True) # Es importante cerrar la conexion para evitar problemas como el reportado en # https://stackoverflow.com/questions/40830638/cannot-load-main-class-from-jar-file sc.stop() if __name__ == '__main__': run_etl()
Log's
+ '[' -z s3://mybucket.spark.jobs/siebel-casos-actividades ']' + aws s3 cp s3://mybucket.spark.jobs/siebel-casos-actividades /opt/ --recursive --include '*' download: s3://mybucket.spark.jobs/siebel-casos-actividades/txd-pyspark-siebel-casos.py to ../../txd-pyspark-siebel-casos.py download: s3://mybucket.spark.jobs/siebel-casos-actividades/txd-pyspark-siebel-actividades.py to ../../txd-pyspark-siebel-actividades.py download: s3://mybucket.jobs/siebel-casos-actividades/hadoop-aws-3.3.1.jar to ../../hadoop-aws-3.3.1.jar download: s3://mybucket.spark.jobs/siebel-casos-actividades/ojdbc8.jar to ../../ojdbc8.jar download: s3://mybucket.spark.jobs/siebel-casos-actividades/aws-java-sdk-bundle-1.11.901.jar to ../../aws-java-sdk-bundle-1.11.901.jar ++ id -u + myuid=0 ++ id -g + mygid=0 + set +e ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + set -e + '[' -z root:x:0:0:root:/root:/bin/ash ']' + SPARK_K8S_CMD=driver-py + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -n '' ']' + PYSPARK_ARGS= + '[' -n '2021-12-18 s3a://mybucket.raw/siebel/casos/' ']' + PYSPARK_ARGS='2021-12-18 s3a://mybucket.raw/siebel/casos/' + R_ARGS= + '[' -n '' ']' + '[' 3 == 2 ']' + '[' 3 == 3 ']' ++ python3 -V + pyv3='Python 3.6.9' + export PYTHON_VERSION=3.6.9 + PYTHON_VERSION=3.6.9 + export PYSPARK_PYTHON=python3 + PYSPARK_PYTHON=python3 + export PYSPARK_DRIVER_PYTHON=python3 + PYSPARK_DRIVER_PYTHON=python3 + case "$SPARK_K8S_CMD" in + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@" $PYSPARK_PRIMARY $PYSPARK_ARGS) + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=0.0.0.0/0 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner file:/opt/txd-pyspark-siebel-casos.py 2021-12-18 s3a://mybucket.raw/siebel/casos/ 21/12/21 18:37:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/12/21 18:37:45 INFO SparkContext: Running Spark version 2.4.4 21/12/21 18:37:45 INFO SparkContext: Submitted application: TXD-PYSPARK-ORACLE-SIEBEL-CASOS 21/12/21 18:37:45 INFO SecurityManager: Changing view acls to: root 21/12/21 18:37:45 INFO SecurityManager: Changing modify acls to: root 21/12/21 18:37:45 INFO SecurityManager: Changing view acls groups to: 21/12/21 18:37:45 INFO SecurityManager: Changing modify acls groups to: 21/12/21 18:37:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 21/12/21 18:37:45 INFO Utils: Successfully started service 'sparkDriver' on port 7078. 21/12/21 18:37:45 INFO SparkEnv: Registering MapOutputTracker 21/12/21 18:37:45 INFO SparkEnv: Registering BlockManagerMaster 21/12/21 18:37:45 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 21/12/21 18:37:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 21/12/21 18:37:45 INFO DiskBlockManager: Created local directory at /var/data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/blockmgr-6c240735-3731-487a-a592-5c9a4d687020 21/12/21 18:37:45 INFO MemoryStore: MemoryStore started with capacity 413.9 MB 21/12/21 18:37:45 INFO SparkEnv: Registering OutputCommitCoordinator 21/12/21 18:37:46 INFO Utils: Successfully started service 'SparkUI' on port 4040. 21/12/21 18:37:46 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://spark-siebel-casos-1640111855179-driver-svc.spark.svc:4040 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/ojdbc8.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/ojdbc8.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/aws-java-sdk-bundle-1.11.901.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/aws-java-sdk-bundle-1.11.901.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/hadoop-aws-3.3.1.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/hadoop-aws-3.3.1.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added file file:///opt/txd-pyspark-siebel-casos.py at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/files/txd-pyspark-siebel-casos.py with timestamp 1640111866266 21/12/21 18:37:46 INFO Utils: Copying /opt/txd-pyspark-siebel-casos.py to /var/data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350/userFiles-32cbe539-22db-4547-8d22-d98f85354418/txd-pyspark-siebel-casos.py 21/12/21 18:37:48 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 21/12/21 18:37:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. 21/12/21 18:37:48 INFO NettyBlockTransferService: Server created on spark-siebel-casos-1640111855179-driver-svc.spark.svc:7079 21/12/21 18:37:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 21/12/21 18:37:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManagerMasterEndpoint: Registering block manager spark-siebel-casos-1640111855179-driver-svc.spark.svc:7079 with 413.9 MB RAM, BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.3.170.156:58300) with ID 1 21/12/21 18:37:53 INFO BlockManagerMasterEndpoint: Registering block manager 10.3.170.156:34671 with 413.9 MB RAM, BlockManagerId(1, 10.3.170.156, 34671, None) 21/12/21 18:37:54 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.3.170.184:52960) with ID 2 21/12/21 18:37:54 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 21/12/21 18:37:54 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse'). 21/12/21 18:37:54 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'. 21/12/21 18:37:54 INFO BlockManagerMasterEndpoint: Registering block manager 10.3.170.184:46293 with 413.9 MB RAM, BlockManagerId(2, 10.3.170.184, 46293, None) 21/12/21 18:37:54 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint ['/opt/txd-pyspark-siebel-casos.py', '2021-12-18', 's3a://mybucket.raw/siebel/casos/'] load_date 2021-12-18 output_path s3a://mybucket.raw/siebel/casos/ Traceback (most recent call last): File "/opt/txd-pyspark-siebel-casos.py", line 68, in <module> run_etl() File "/opt/txd-pyspark-siebel-casos.py", line 60, in run_etl df.write.mode("append").partitionBy(["year", "month", "day"]).csv(output_path, header=True) File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 931, in csv File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o83.csv. : com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)21/12/21 18:38:00 INFO SparkContext: Invoking stop() from shutdown hook 21/12/21 18:38:00 INFO SparkUI: Stopped Spark web UI at http://spark-siebel-casos-1640111855179-driver-svc.spark.svc:4040 21/12/21 18:38:00 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 21/12/21 18:38:00 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 21/12/21 18:38:00 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) 21/12/21 18:38:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 21/12/21 18:38:01 INFO MemoryStore: MemoryStore cleared 21/12/21 18:38:01 INFO BlockManager: BlockManager stopped 21/12/21 18:38:01 INFO BlockManagerMaster: BlockManagerMaster stopped 21/12/21 18:38:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 21/12/21 18:38:01 INFO SparkContext: Successfully stopped SparkContext 21/12/21 18:38:01 INFO ShutdownHookManager: Shutdown hook called 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory /var/data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-245a2532-04c0-4309-9f43-00fbca06d435 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory /var/data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350/pyspark-83a9a9c2-0b44-4bcf-8d86-6fb388ba275e
Pod describe:
Containers: spark-kubernetes-driver: Container ID: docker://3606841142e7dc76f3a5b29f7df87da8159a0d0c53897f96444670a04134a2ff Image: registry.example.com/myorg/ata/spark/spark-k8s/spark-py:2.4.4 Image ID: docker-pullable://registry.example.com/myorg/ata/spark/spark-k8s/spark-py@sha256:744eae637693e0c6f2195ed1e4e2bab9def5b9c7507518c5d4b61b7933c63e10 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver-py --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner State: Terminated Reason: Error Exit Code: 1 Started: Tue, 21 Dec 2021 12:44:59 -0300 Finished: Tue, 21 Dec 2021 12:45:29 -0300 Ready: False Restart Count: 0 Limits: cpu: 1 memory: 1433Mi Requests: cpu: 1 memory: 1433Mi Environment: JOB_PATH: s3://mybucket.spark.jobs/siebel-casos-actividades SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-9ff6233c-1660-4be4-94b7-2e961412f958 PYSPARK_PRIMARY: file:/opt/txd-pyspark-siebel-casos.py PYSPARK_MAJOR_PYTHON_VERSION: 3 PYSPARK_APP_ARGS: 2021-12-18 s3a://mybucket.raw/siebel/casos/ PYSPARK_FILES: SPARK_CONF_DIR: /opt/spark/conf AWS_DEFAULT_REGION: us-east-1 AWS_REGION: us-east-1 AWS_ROLE_ARN: arn:aws:iam::999999999999:role/EKSDefaultPolicyFor-Spark AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-9ff6233c-1660-4be4-94b7-2e961412f958 from spark-local-dir-1 (rw) /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro) /var/run/secrets/kubernetes.io/serviceaccount from spark-token-r6p46 (ro)
Classpath:
21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/ojdbc8.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/ojdbc8.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/aws-java-sdk-bundle-1.11.901.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/aws-java-sdk-bundle-1.11.901.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/hadoop-aws-3.3.1.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/hadoop-aws-3.3.1.jar with timestamp 1640111866249
From top to down. Kubernetes version 1.18.20 Spark Version: 2.4.4 Kubernetes Setup: Pod with serviceAccountName that binds with IAM Role using IRSA (EKS Feature). apiVersion: v1 automountServiceAccountToken: true kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::999999999999:role/EKSDefaultPolicyFor-Spark name: spark namespace: spark AWS Setup: IAM Role with permissions over the S3 Bucket Bucket with permissions granted over the IAM Role. Code: def run_etl(): sc = SparkSession.builder.appName( "TXD-PYSPARK-ORACLE-SIEBEL-CASOS" ).getOrCreate() sqlContext = SQLContext(sc) args = sys.argv load_date = args[1] # Ej: "2019-05-21" output_path = args[2] # Ej: s3: //mybucket/myfolder print(args, "load_date" , load_date, "output_path" , output_path) sc._jsc.hadoopConfiguration().set( "fs.s3a.aws.credentials.provider" , "com.amazonaws.auth.DefaultAWSCredentialsProviderChain" ) sc._jsc.hadoopConfiguration().set( "com.amazonaws.services.s3.enableV4" , " true " ) sc._jsc.hadoopConfiguration().set( "fs.s3a.impl" , "org.apache.hadoop.fs.s3a.S3AFileSystem" ) # sc._jsc.hadoopConfiguration().set( "fs.s3.impl" , "org.apache.hadoop.fs.s3native.NativeS3FileSystem" ) sc._jsc.hadoopConfiguration().set( "fs.AbstractFileSystem.s3a.impl" , "org.apache.hadoop.fs.s3a.S3A" ) session = boto3.session.Session() client = session.client(service_name= 'secretsmanager' , region_name= "us-east-1" ) get_secret_value_response = client.get_secret_value( SecretId= "Siebel_Connection_Info" ) secret = get_secret_value_response[ "SecretString" ] secret = json.loads(secret) db_username = secret.get( "db_username" ) db_password = secret.get( "db_password" ) db_host = secret.get( "db_host" ) db_port = secret.get( "db_port" ) db_name = secret.get( "db_name" ) db_url = "jdbc:oracle:thin:@{}:{}/{}" .format(db_host, db_port, db_name) jdbc_driver_name = "oracle.jdbc.OracleDriver" dbtable = """(SELECT * FROM SIEBEL.REPORTE_DE_CASOS WHERE JOB_ID IN (SELECT JOB_ID FROM SIEBEL.SERVICE_CONSUMED_STATUS WHERE PUBLISH_INFORMATION_DT BETWEEN TO_DATE( '{} 00:00:00' , 'YYYY-MM-DD HH24:MI:SS' ) AND TO_DATE( '{} 23:59:59' , 'YYYY-MM-DD HH24:MI:SS' )))" "".format(load_date, load_date) df = sqlContext.read\ .format( "jdbc" )\ .option( "charset" , "utf8" )\ .option( "driver" , jdbc_driver_name)\ .option( "url" ,db_url)\ .option( "dbtable" , dbtable)\ .option( "user" , db_username)\ .option( "password" , db_password)\ .option( "oracle.jdbc.timezoneAsRegion" , " false " )\ .load() # Particionado a_load_date = load_date.split( '-' ) df = df.withColumn( "year" , lit(a_load_date[0])) df = df.withColumn( "month" , lit(a_load_date[1])) df = df.withColumn( "day" , lit(a_load_date[2])) df.write.mode( "append" ).partitionBy([ "year" , "month" , "day" ]).csv(output_path, header=True) # Es importante cerrar la conexion para evitar problemas como el reportado en # https: //stackoverflow.com/questions/40830638/cannot-load-main- class- from-jar-file sc.stop() if __name__ == '__main__' : run_etl() Log's + '[' -z s3: //mybucket.spark.jobs/siebel-casos-actividades ']' + aws s3 cp s3: //mybucket.spark.jobs/siebel-casos-actividades /opt/ --recursive --include '*' download: s3: //mybucket.spark.jobs/siebel-casos-actividades/txd-pyspark-siebel-casos.py to ../../txd-pyspark-siebel-casos.py download: s3: //mybucket.spark.jobs/siebel-casos-actividades/txd-pyspark-siebel-actividades.py to ../../txd-pyspark-siebel-actividades.py download: s3: //mybucket.jobs/siebel-casos-actividades/hadoop-aws-3.3.1.jar to ../../hadoop-aws-3.3.1.jar download: s3: //mybucket.spark.jobs/siebel-casos-actividades/ojdbc8.jar to ../../ojdbc8.jar download: s3: //mybucket.spark.jobs/siebel-casos-actividades/aws-java-sdk-bundle-1.11.901.jar to ../../aws-java-sdk-bundle-1.11.901.jar ++ id -u + myuid=0 ++ id -g + mygid=0 + set +e ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + set -e + '[' -z root:x:0:0:root:/root:/bin/ash ']' + SPARK_K8S_CMD=driver-py + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH= ':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ' ]' + '[' -n '' ' ]' + PYSPARK_ARGS= + '[' -n '2021-12-18 s3a: //mybucket.raw/siebel/casos/' ']' + PYSPARK_ARGS= '2021-12-18 s3a: //mybucket.raw/siebel/casos/' + R_ARGS= + '[' -n '' ' ]' + '[' 3 == 2 ']' + '[' 3 == 3 ']' ++ python3 -V + pyv3= 'Python 3.6.9' + export PYTHON_VERSION=3.6.9 + PYTHON_VERSION=3.6.9 + export PYSPARK_PYTHON=python3 + PYSPARK_PYTHON=python3 + export PYSPARK_DRIVER_PYTHON=python3 + PYSPARK_DRIVER_PYTHON=python3 + case "$SPARK_K8S_CMD" in + CMD=( "$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@" $PYSPARK_PRIMARY $PYSPARK_ARGS) + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=0.0.0.0/0 --deploy-mode client --properties-file /opt/spark/conf/spark.properties -- class org.apache.spark.deploy.PythonRunner file:/opt/txd-pyspark-siebel-casos.py 2021-12-18 s3a: //mybucket.raw/siebel/casos/ 21/12/21 18:37:43 WARN NativeCodeLoader: Unable to load native -hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/12/21 18:37:45 INFO SparkContext: Running Spark version 2.4.4 21/12/21 18:37:45 INFO SparkContext: Submitted application: TXD-PYSPARK-ORACLE-SIEBEL-CASOS 21/12/21 18:37:45 INFO SecurityManager : Changing view acls to: root 21/12/21 18:37:45 INFO SecurityManager : Changing modify acls to: root 21/12/21 18:37:45 INFO SecurityManager : Changing view acls groups to: 21/12/21 18:37:45 INFO SecurityManager : Changing modify acls groups to: 21/12/21 18:37:45 INFO SecurityManager : SecurityManager : authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 21/12/21 18:37:45 INFO Utils: Successfully started service 'sparkDriver' on port 7078. 21/12/21 18:37:45 INFO SparkEnv: Registering MapOutputTracker 21/12/21 18:37:45 INFO SparkEnv: Registering BlockManagerMaster 21/12/21 18:37:45 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 21/12/21 18:37:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 21/12/21 18:37:45 INFO DiskBlockManager: Created local directory at / var /data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/blockmgr-6c240735-3731-487a-a592-5c9a4d687020 21/12/21 18:37:45 INFO MemoryStore: MemoryStore started with capacity 413.9 MB 21/12/21 18:37:45 INFO SparkEnv: Registering OutputCommitCoordinator 21/12/21 18:37:46 INFO Utils: Successfully started service 'SparkUI' on port 4040. 21/12/21 18:37:46 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http: //spark-siebel-casos-1640111855179-driver-svc.spark.svc:4040 21/12/21 18:37:46 INFO SparkContext: Added JAR file: ///opt/ojdbc8.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/ojdbc8.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file: ///opt/aws-java-sdk-bundle-1.11.901.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/aws-java-sdk-bundle-1.11.901.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file: ///opt/hadoop-aws-3.3.1.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/hadoop-aws-3.3.1.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added file file: ///opt/txd-pyspark-siebel-casos.py at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/files/txd-pyspark-siebel-casos.py with timestamp 1640111866266 21/12/21 18:37:46 INFO Utils: Copying /opt/txd-pyspark-siebel-casos.py to / var /data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350/userFiles-32cbe539-22db-4547-8d22-d98f85354418/txd-pyspark-siebel-casos.py 21/12/21 18:37:48 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 21/12/21 18:37:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. 21/12/21 18:37:48 INFO NettyBlockTransferService: Server created on spark-siebel-casos-1640111855179-driver-svc.spark.svc:7079 21/12/21 18:37:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 21/12/21 18:37:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManagerMasterEndpoint: Registering block manager spark-siebel-casos-1640111855179-driver-svc.spark.svc:7079 with 413.9 MB RAM, BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-siebel-casos-1640111855179-driver-svc.spark.svc, 7079, None) 21/12/21 18:37:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client: //Executor) (10.3.170.156:58300) with ID 1 21/12/21 18:37:53 INFO BlockManagerMasterEndpoint: Registering block manager 10.3.170.156:34671 with 413.9 MB RAM, BlockManagerId(1, 10.3.170.156, 34671, None) 21/12/21 18:37:54 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client: //Executor) (10.3.170.184:52960) with ID 2 21/12/21 18:37:54 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 21/12/21 18:37:54 INFO SharedState: Setting hive.metastore.warehouse.dir ( ' null ' ) to the value of spark.sql.warehouse.dir ( 'file:/opt/spark/work-dir/spark-warehouse' ). 21/12/21 18:37:54 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse' . 21/12/21 18:37:54 INFO BlockManagerMasterEndpoint: Registering block manager 10.3.170.184:46293 with 413.9 MB RAM, BlockManagerId(2, 10.3.170.184, 46293, None) 21/12/21 18:37:54 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint [ '/opt/txd-pyspark-siebel-casos.py' , '2021-12-18' , 's3a: //mybucket.raw/siebel/casos/' ] load_date 2021-12-18 output_path s3a://mybucket.raw/siebel/casos/ Traceback (most recent call last): File "/opt/txd-pyspark-siebel-casos.py" , line 68, in <module> run_etl() File "/opt/txd-pyspark-siebel-casos.py" , line 60, in run_etl df.write.mode( "append" ).partitionBy([ "year" , "month" , "day" ]).csv(output_path, header=True) File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py" , line 931, in csv File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py" , line 1257, in __call__ File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py" , line 63, in deco File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py" , line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o83.csv. : com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang. Thread .run( Thread .java:748)21/12/21 18:38:00 INFO SparkContext: Invoking stop() from shutdown hook 21/12/21 18:38:00 INFO SparkUI: Stopped Spark web UI at http: //spark-siebel-casos-1640111855179-driver-svc.spark.svc:4040 21/12/21 18:38:00 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 21/12/21 18:38:00 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 21/12/21 18:38:00 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed ( this is expected if the application is shutting down.) 21/12/21 18:38:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 21/12/21 18:38:01 INFO MemoryStore: MemoryStore cleared 21/12/21 18:38:01 INFO BlockManager: BlockManager stopped 21/12/21 18:38:01 INFO BlockManagerMaster: BlockManagerMaster stopped 21/12/21 18:38:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 21/12/21 18:38:01 INFO SparkContext: Successfully stopped SparkContext 21/12/21 18:38:01 INFO ShutdownHookManager: Shutdown hook called 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory / var /data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-245a2532-04c0-4309-9f43-00fbca06d435 21/12/21 18:38:01 INFO ShutdownHookManager: Deleting directory / var /data/spark-458585a1-50f9-45c6-a4cf-d552c04a97dc/spark-f99cee68-d203-4a2a-8335-9743eeac5350/pyspark-83a9a9c2-0b44-4bcf-8d86-6fb388ba275e Pod describe: Containers: spark-kubernetes-driver: Container ID: docker: //3606841142e7dc76f3a5b29f7df87da8159a0d0c53897f96444670a04134a2ff Image: registry.example.com/myorg/ata/spark/spark-k8s/spark-py:2.4.4 Image ID: docker-pullable: //registry.example.com/myorg/ata/spark/spark-k8s/spark-py@sha256:744eae637693e0c6f2195ed1e4e2bab9def5b9c7507518c5d4b61b7933c63e10 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver-py --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner State: Terminated Reason: Error Exit Code: 1 Started: Tue, 21 Dec 2021 12:44:59 -0300 Finished: Tue, 21 Dec 2021 12:45:29 -0300 Ready: False Restart Count: 0 Limits: cpu: 1 memory: 1433Mi Requests: cpu: 1 memory: 1433Mi Environment: JOB_PATH: s3: //mybucket.spark.jobs/siebel-casos-actividades SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: / var /data/spark-9ff6233c-1660-4be4-94b7-2e961412f958 PYSPARK_PRIMARY: file:/opt/txd-pyspark-siebel-casos.py PYSPARK_MAJOR_PYTHON_VERSION: 3 PYSPARK_APP_ARGS: 2021-12-18 s3a: //mybucket.raw/siebel/casos/ PYSPARK_FILES: SPARK_CONF_DIR: /opt/spark/conf AWS_DEFAULT_REGION: us-east-1 AWS_REGION: us-east-1 AWS_ROLE_ARN: arn:aws:iam::999999999999:role/EKSDefaultPolicyFor-Spark AWS_WEB_IDENTITY_TOKEN_FILE: / var /run/secrets/eks.amazonaws.com/serviceaccount/token Mounts: /opt/spark/conf from spark-conf-volume (rw) / var /data/spark-9ff6233c-1660-4be4-94b7-2e961412f958 from spark-local-dir-1 (rw) / var /run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro) / var /run/secrets/kubernetes.io/serviceaccount from spark-token-r6p46 (ro) Classpath: 21/12/21 18:37:46 INFO SparkContext: Added JAR file: ///opt/ojdbc8.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/ojdbc8.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/aws-java-sdk-bundle-1.11.901.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/aws-java-sdk-bundle-1.11.901.jar with timestamp 1640111866249 21/12/21 18:37:46 INFO SparkContext: Added JAR file:///opt/hadoop-aws-3.3.1.jar at spark://spark-siebel-casos-1640111855179-driver-svc.spark.svc:7078/jars/hadoop-aws-3.3.1.jar with timestamp 1640111866249
-
aws
Description
Hello everybody, please help with this issue. I have a job running with spark over kubernetes (AWS EKS) and I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o83.csv.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)