Details
-
Sub-task
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.3.4
-
None
-
None
Description
I've tried to connect from PySpark to Minio running in docker.
Installing PySpark and starting Minio:
pip install pyspark==3.4.1 docker run --rm -d --hostname minio --name minio -p 9000:9000 -p 9001:9001 -e MINIO_ACCESS_KEY=access -e MINIO_SECRET_KEY=Eevoh2wo0ui6ech0wu8oy3feiR3eicha -e MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=iepaegaigi3ofa9TaephieSo1iecaesh bitnami/minio:latest docker exec minio mc mb test-bucket
Then create Spark session:
from pyspark.sql import SparkSession spark = SparkSession.builder\ .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4")\ .config("spark.hadoop.fs.s3a.endpoint", "localhost:9000")\ .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")\ .config("spark.hadoop.fs.s3a.path.style.access", "true")\ .config("spark.hadoop.fs.s3a.access.key", "access")\ .config("spark.hadoop.fs.s3a.secret.key", "Eevoh2wo0ui6ech0wu8oy3feiR3eicha")\ .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\ .getOrCreate() spark.sparkContext.setLogLevel("debug")
And try to access some object in a bucket:
import time begin = time.perf_counter() spark.read.format("csv").load("s3a://test-bucket/fake") end = time.perf_counter() py4j.protocol.Py4JJavaError: An error occurred while calling o40.load. : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unsupported or unrecognized SSL message: Unable to execute HTTP request: Unsupported or unrecognized SSL message ...
>>> print((end-begin)/60, "min") 14.72387898775002 min
I was waiting almost 15 minutes to get the exception from Spark. The reason was I tried to connect to endpoint with fs.s3a.connection.ssl.enabled=true, but Minio is configured to listen for HTTP protocol only.
Is there any way to immediately raise exception if SSL connection cannot be established?
If I try to pass wrong endpoint, like localhos:9000, I'll get exception like this in just 5 seconds:
: org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute HTTP request: test-bucket.localhos: Unable to execute HTTP request: test-bucket.localhos
...
>>> print(end-begin, "sec") 5.700424307000503 sec
I know about options like fs.s3a.attempts.maximum and fs.s3a.retry.limit, setting them to 1 will cause raising exception just immediately. But this does not look right.