[HADOOP-17842] S3a parquet reads slow with Spark on Kubernetes (EKS) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Works for Me
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

I am trying to read parquet saved in S3 via Spark on EKS using hadoop-AWS 3.2.0. There are 112 partitions (each around 130MB) for a particular month.

The data is being read but very very slowly. I just keep seeing below and very small dataset actually being fetched.

21/08/09 05:07:05 DEBUG Executor task launch worker for task 60.0 in stage 3.0 (TID 63) Invoker: Values passed - text: read on s3a://uat1-prp-rftu-25-045552507264-us-east-1/xxxx/yyyy/zzzz/table_fact_mtd_c/ptn_val_txt=20200229/part-00012-32dbfb10-b43c-4066-a70e-d3575ea530d5-c000.snappy.parquet, idempotent: true, Retried: org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$1199/2130521693@5259f9d0, Operation:org.apache.hadoop.fs.s3a.Invoker$$Lambda$1239/37396157@454de3d3

21/08/09 05:07:05 DEBUG Executor task launch worker for task 60.0 in stage 3.0 (TID 63) Invoker: retryUntranslated begin

21/08/09 05:07:05 DEBUG Executor task launch worker for task 60.0 in stage 3.0 (TID 63) Invoker: Values passed - text: lazySeek on s3a://uat1-prp-rftu-25-045552507264-us-east-1/xxxx/yyyy/zzzz/table_fact_mtd_c/ptn_val_txt=20200229/part-00012-32dbfb10-b43c-4066-a70e-d3575ea530d5-c000.snappy.parquet, idempotent: true, Retried: org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$1199/2130521693@5259f9d0, Operation:org.apache.hadoop.fs.s3a.Invoker$$Lambda$1239/37396157@3776ef6c

21/08/09 05:07:05 DEBUG Executor task launch worker for task 60.0 in stage 3.0 (TID 63) Invoker: retryUntranslated begin

Here is the spark config for hadoop-aws.

spark.hadoop.fs.s3a.assumed.role.sts.endpoint: https://sts.amazonaws.com

spark.hadoop.fs.s3a.assumed.role.sts.endpoint.region: us-east-1

spark.hadoop.fs.s3a.attempts.maximum: 20

spark.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider

spark.hadoop.fs.s3a.block.size: 128M

spark.hadoop.fs.s3a.connection.establish.timeout: 50000

spark.hadoop.fs.s3a.connection.maximum: 50

spark.hadoop.fs.s3a.connection.ssl.enabled: true

spark.hadoop.fs.s3a.connection.timeout: 2000000

spark.hadoop.fs.s3a.endpoint: s3.us-east-1.amazonaws.com

spark.hadoop.fs.s3a.etag.checksum.enabled: false

spark.hadoop.fs.s3a.experimental.input.fadvise: normal

spark.hadoop.fs.s3a.fast.buffer.size: 1048576

spark.hadoop.fs.s3a.fast.upload: true

spark.hadoop.fs.s3a.fast.upload.active.blocks: 8

spark.hadoop.fs.s3a.fast.upload.buffer: bytebuffer

spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

spark.hadoop.fs.s3a.list.version: 2

spark.hadoop.fs.s3a.max.total.tasks: 30

spark.hadoop.fs.s3a.metadatastore.authoritative: false

spark.hadoop.fs.s3a.metadatastore.impl: org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore

spark.hadoop.fs.s3a.multiobjectdelete.enable: true

spark.hadoop.fs.s3a.multipart.purge: true

spark.hadoop.fs.s3a.multipart.purge.age: 86400

spark.hadoop.fs.s3a.multipart.size: 32M

spark.hadoop.fs.s3a.multipart.threshold: 64M

spark.hadoop.fs.s3a.paging.maximum: 5000

spark.hadoop.fs.s3a.readahead.range: 65536

spark.hadoop.fs.s3a.retry.interval: 500ms

spark.hadoop.fs.s3a.retry.limit: 20

spark.hadoop.fs.s3a.retry.throttle.interval: 500ms

spark.hadoop.fs.s3a.retry.throttle.limit: 20

spark.hadoop.fs.s3a.s3.client.factory.impl: org.apache.hadoop.fs.s3a.DefaultS3ClientFactory

spark.hadoop.fs.s3a.s3guard.ddb.background.sleep: 25

spark.hadoop.fs.s3a.s3guard.ddb.max.retries: 20

spark.hadoop.fs.s3a.s3guard.ddb.region: us-east-1

spark.hadoop.fs.s3a.s3guard.ddb.table: s3-data-guard-master

spark.hadoop.fs.s3a.s3guard.ddb.table.capacity.read: 500

spark.hadoop.fs.s3a.s3guard.ddb.table.capacity.write: 100

spark.hadoop.fs.s3a.s3guard.ddb.table.create: true

spark.hadoop.fs.s3a.s3guard.ddb.throttle.retry.interval: 1s

spark.hadoop.fs.s3a.socket.recv.buffer: 8388608

spark.hadoop.fs.s3a.socket.send.buffer: 8388608

spark.hadoop.fs.s3a.threads.keepalivetime: 60

spark.hadoop.fs.s3a.threads.max: 50

Not sure if you need it - still putting it across (other spark configuration)

spark.app.id: spark-b97cb651f3f14c6cb3197079376a74c7

spark.app.startTime: 1628476986471

spark.blockManager.port: 0

spark.broadcast.compress: true

spark.checkpoint.compress: true

spark.cleaner.periodicGC.interval: 2min

spark.cleaner.referenceTracking: true

spark.cleaner.referenceTracking.blocking: true

spark.cleaner.referenceTracking.blocking.shuffle: true

spark.cleaner.referenceTracking.cleanCheckpoints: true

spark.cores.max: 5

spark.driver.bindAddress: 28.132.124.86

spark.driver.blockManager.port: 0

spark.driver.cores: 5

spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'

spark.driver.host: xxx-xxxx-xxx-8be6777b28caacc7-driver-svc.default.svc

spark.driver.maxResultSize: 10008m

spark.driver.memory: 10008m

spark.driver.memoryOverhead: 384m

spark.driver.port: 7078

spark.driver.rpc.io.clientThreads: 5

spark.driver.rpc.io.serverThreads: 5

spark.driver.rpc.netty.dispatcher.numThreads: 5

spark.driver.shuffle.io.clientThreads: 5

spark.driver.shuffle.io.serverThreads: 5

spark.dynamicAllocation.cachedExecutorIdleTimeout: 600s

spark.dynamicAllocation.enabled: false

spark.dynamicAllocation.executorAllocationRatio: 1.0

spark.dynamicAllocation.executorIdleTimeout: 60s

spark.dynamicAllocation.initialExecutors: 1

spark.dynamicAllocation.maxExecutors: 2147483647

spark.dynamicAllocation.minExecutors: 1

spark.dynamicAllocation.schedulerBacklogTimeout: 1s

spark.dynamicAllocation.shuffleTracking.enabled: true

spark.dynamicAllocation.shuffleTracking.timeout: 600s

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: 1s

spark.eventLog.dir: /opt/efs/spark

spark.eventLog.enabled: true

spark.eventLog.logStageExecutorMetrics: false

spark.excludeOnFailure.enabled: true

spark.executor.cores: 5

spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'

spark.executor.id: driver

spark.executor.instances: 22

spark.executor.logs.rolling.enableCompression: false

spark.executor.logs.rolling.maxRetainedFiles: 5

spark.executor.logs.rolling.maxSize: 10m

spark.executor.logs.rolling.strategy: size

spark.executor.memory: 10008m

spark.executor.memoryOverhead: 384m

spark.executor.processTreeMetrics.enabled: false

spark.executor.rpc.io.clientThreads: 5

spark.executor.rpc.io.serverThreads: 5

spark.executor.rpc.netty.dispatcher.numThreads: 5

spark.executor.shuffle.io.clientThreads: 5

spark.executor.shuffle.io.serverThreads: 5

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: 2

spark.history.fs.driverlog.cleaner.enabled: true

spark.history.fs.driverlog.cleaner.maxAge: 2d

spark.history.fs.logDirectory: /opt/efs/spark

spark.history.ui.port: 4040

spark.io.compression.codec: org.apache.spark.io.SnappyCompressionCodec

spark.io.compression.snappy.blockSize: 32k

spark.jars: local:///opt/spark/examples/xxx.jar,local:///opt/spark/examples/yyy.jar

spark.kryo.referenceTracking: false

spark.kryo.registrationRequired: false

spark.kryo.unsafe: true

spark.kryoserializer.buffer: 8m

spark.kryoserializer.buffer.max: 1024m

spark.kubernetes.allocation.batch.delay: 1s

spark.kubernetes.allocation.batch.size: 5

spark.kubernetes.allocation.executor.timeout: 600s

spark.kubernetes.appKillPodDeletionGracePeriod: 5s

spark.kubernetes.authenticate.driver.serviceAccountName: spark

spark.kubernetes.configMap.maxSize: 1572864

spark.kubernetes.container.image: xxx/xxx:latest

spark.kubernetes.container.image.pullPolicy: Always

spark.kubernetes.driver.connectionTimeout: 10000

spark.kubernetes.driver.limit.cores: 8

spark.kubernetes.driver.master: https://asdkadalksjdas.gr7.us-east-1.eks.amazonaws.com:443

spark.kubernetes.driver.pod.name: xxx-ddd-rrrr-8be6777b28caacc7-driver

spark.kubernetes.driver.request.cores: 5

spark.kubernetes.driver.requestTimeout: 10000

spark.kubernetes.driver.volumes.persistentVolumeClaim.efs-pvc-mount-d.mount.path: /opt/efs/spark

spark.kubernetes.driver.volumes.persistentVolumeClaim.efs-pvc-mount-d.mount.readOnly: false

spark.kubernetes.driver.volumes.persistentVolumeClaim.efs-pvc-mount-d.mount.subPath: spark

spark.kubernetes.driver.volumes.persistentVolumeClaim.efs-pvc-mount-d.options.claimName: efs-pvc

spark.kubernetes.driver.volumes.persistentVolumeClaim.efs-pvc-mount-d.options.storageClass: manual

spark.kubernetes.dynamicAllocation.deleteGracePeriod: 5s

spark.kubernetes.executor.apiPollingInterval: 60s

spark.kubernetes.executor.checkAllContainers: true

spark.kubernetes.executor.deleteOnTermination: false

spark.kubernetes.executor.eventProcessingInterval: 5s

spark.kubernetes.executor.limit.cores: 8

spark.kubernetes.executor.missingPodDetectDelta: 30s

spark.kubernetes.executor.podNamePrefix: uscb-exec

spark.kubernetes.executor.request.cores: 5

spark.kubernetes.executor.volumes.persistentVolumeClaim.efs-pvc-mount-e.mount.path: /opt/efs/spark

spark.kubernetes.executor.volumes.persistentVolumeClaim.efs-pvc-mount-e.mount.readOnly: false

spark.kubernetes.executor.volumes.persistentVolumeClaim.efs-pvc-mount-e.mount.subPath: spark

spark.kubernetes.executor.volumes.persistentVolumeClaim.efs-pvc-mount-e.options.claimName: efs-pvc

spark.kubernetes.executor.volumes.persistentVolumeClaim.efs-pvc-mount-e.options.storageClass: manual

spark.kubernetes.local.dirs.tmpfs: false

spark.kubernetes.memoryOverheadFactor: 0.1

spark.kubernetes.namespace: default

spark.kubernetes.report.interval: 5s

spark.kubernetes.resource.type: java

spark.kubernetes.submission.connectionTimeout: 10000

spark.kubernetes.submission.requestTimeout: 10000

spark.kubernetes.submission.waitAppCompletion: true

spark.kubernetes.submitInDriver: true

spark.local.dir: /tmp

spark.locality.wait: 3s

spark.locality.wait.node: 3s

spark.locality.wait.process: 3s

spark.locality.wait.rack: 3s

spark.master: k8s://https://NKSLODISNJSKSJSKKLS.gr7.us-east-1.eks.amazonaws.com:443

spark.memory.fraction: 0.6

spark.memory.offHeap.enabled: false

spark.memory.storageFraction: 0.5

spark.network.io.preferDirectBufs: true

spark.network.maxRemoteBlockSizeFetchToMem: 200m

spark.network.timeout: 120s

spark.port.maxRetries: 16

spark.rdd.compress: false

spark.reducer.maxBlocksInFlightPerAddress: 2147483647

spark.reducer.maxReqsInFlight: 2147483647

spark.reducer.maxSizeInFlight: 48m

spark.repl.local.jars: local:///opt/spark/examples/asdasdasd.jar

spark.rpc.askTimeout: 120s

spark.rpc.io.backLog: 256

spark.rpc.io.clientThreads: 5

spark.rpc.io.serverThreads: 5

spark.rpc.lookupTimeout: 120s

spark.rpc.message.maxSize: 128

spark.rpc.netty.dispatcher.numThreads: 5

spark.rpc.numRetries: 3

spark.rpc.retry.wait: 3s

spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout: 120s

spark.scheduler.listenerbus.eventqueue.appStatus.capacity: 10000

spark.scheduler.listenerbus.eventqueue.capacity: 10000

spark.scheduler.listenerbus.eventqueue.eventLog.capacity: 10000

spark.scheduler.listenerbus.eventqueue.executorManagement.capacity: 10000

spark.scheduler.listenerbus.eventqueue.shared.capacity: 10000

spark.scheduler.maxRegisteredResourcesWaitingTime: 30s

spark.scheduler.minRegisteredResourcesRatio: 0.8

spark.scheduler.mode: FIFO

spark.scheduler.resource.profileMergeConflicts: false

spark.scheduler.revive.interval: 1s

spark.serializer: org.apache.spark.serializer.KryoSerializer

spark.serializer.objectStreamReset: 100

spark.shuffle.accurateBlockThreshold: 104857600

spark.shuffle.compress: true

spark.shuffle.file.buffer: 128m

spark.shuffle.io.backLog: -1

spark.shuffle.io.maxRetries: 3

spark.shuffle.io.numConnectionsPerPeer: 4

spark.shuffle.io.preferDirectBufs: true

spark.shuffle.io.retryWait: 5s

spark.shuffle.maxChunksBeingTransferred: 9223372036854775807

spark.shuffle.registration.maxAttempts: 3

spark.shuffle.registration.timeout: 200

spark.shuffle.service.enabled: false

spark.shuffle.service.index.cache.size: 100m

spark.shuffle.service.port: 7737

spark.shuffle.sort.bypassMergeThreshold: 200

spark.shuffle.spill.compress: true

spark.speculation: false

spark.speculation.interval: 5s

spark.speculation.multiplier: 1.5

spark.speculation.quantile: 0.75

spark.speculation.task.duration.threshold: 10s

spark.sql.adaptive.coalescePartitions.enabled: true

spark.sql.adaptive.enabled: true

spark.sql.adaptive.fetchShuffleBlocksInBatch: true

spark.sql.adaptive.forceApply: false

spark.sql.adaptive.localShuffleReader.enabled: true

spark.sql.adaptive.logLevel: debug

spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin: 0

spark.sql.adaptive.skewJoin.enabled: true

spark.sql.adaptive.skewJoin.skewedPartitionFactor: 5

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInByte: 256MB

spark.sql.addPartitionInBatch.size: 100

spark.sql.analyzer.failAmbiguousSelfJoin: true

spark.sql.analyzer.maxIterations: 100

spark.sql.ansi.enabled: false

spark.sql.autoBroadcastJoinThreshold: 10MB

spark.sql.avro.filterPushdown.enabled: true

spark.sql.broadcastExchange.maxThreadThreshold: 128

spark.sql.bucketing.coalesceBucketsInJoin.enabled: false

spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio: 4

spark.sql.cache.serializer: org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer

spark.sql.cartesianProductExec.buffer.in.memory.threshold: 4096

spark.sql.caseSensitive: false

spark.sql.catalogImplementation: in-memory

spark.sql.cbo.enabled: false

spark.sql.cbo.joinReorder.card.weight: 0

spark.sql.cbo.joinReorder.dp.star.filter: false

spark.sql.cbo.joinReorder.dp.threshold: 12

spark.sql.cbo.joinReorder.enabled: false

spark.sql.cbo.planStats.enabled: false

spark.sql.cbo.starJoinFTRatio: 0

spark.sql.cbo.starSchemaDetection: false

spark.sql.codegen.aggregate.fastHashMap.capacityBit: 16

spark.sql.codegen.aggregate.map.twolevel.enabled: true

spark.sql.codegen.aggregate.map.vectorized.enable: false

spark.sql.codegen.aggregate.splitAggregateFunc.enabled: true

spark.sql.codegen.cache.maxEntries: 100

spark.sql.codegen.comments: false

spark.sql.codegen.fallback: true

spark.sql.codegen.hugeMethodLimit: 65535

spark.sql.codegen.logging.maxLines: 1000

spark.sql.codegen.maxFields: 100

spark.sql.codegen.methodSplitThreshold: 1024

spark.sql.codegen.splitConsumeFuncByOperator: true

spark.sql.codegen.useIdInClassName: true

spark.sql.codegen.wholeStage: true

spark.sql.columnVector.offheap.enabled: false

spark.sql.constraintPropagation.enabled: true

spark.sql.crossJoin.enabled: true

spark.sql.csv.filterPushdown.enabled: true

spark.sql.csv.parser.columnPruning.enabled: true

spark.sql.datetime.java8API.enabled: false

spark.sql.debug: false

spark.sql.debug.maxToStringFields: 25

spark.sql.decimalOperations.allowPrecisionLoss: true

spark.sql.event.truncate.length: 2147483647

spark.sql.exchange.reuse: true

spark.sql.execution.arrow.enabled: false

spark.sql.execution.arrow.fallback.enabled: true

spark.sql.execution.arrow.maxRecordsPerBatch: 10000

spark.sql.execution.arrow.sparkr.enabled: false

spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit: 8

spark.sql.execution.fastFailOnFileFormatOutput: false

spark.sql.execution.pandas.convertToArrowArraySafely: false

spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled: false

spark.sql.execution.rangeExchange.sampleSizePerPartition: 100

spark.sql.execution.removeRedundantProjects: true

spark.sql.execution.removeRedundantSorts: true

spark.sql.execution.reuseSubquery: true

spark.sql.execution.sortBeforeRepartition: true

spark.sql.execution.useObjectHashAggregateExec: true

spark.sql.files.ignoreCorruptFiles: false

spark.sql.files.ignoreMissingFiles: false

spark.sql.files.maxPartitionBytes: 128MB

spark.sql.files.maxRecordsPerFile: 0

spark.sql.filesourceTableRelationCacheSize: 1000

spark.sql.function.concatBinaryAsString: false

spark.sql.function.eltOutputAsString: false

spark.sql.globalTempDatabase: global_temp

spark.sql.groupByAliases: true

spark.sql.groupByOrdinal: true

spark.sql.hive.advancedPartitionPredicatePushdown.enabled: true

spark.sql.hive.convertCTAS: false

spark.sql.hive.gatherFastStats: true

spark.sql.hive.manageFilesourcePartitions: true

spark.sql.hive.metastorePartitionPruning: true

spark.sql.hive.metastorePartitionPruningInSetThreshold: 1000

spark.sql.hive.verifyPartitionPath: false

spark.sql.inMemoryColumnarStorage.batchSize: 10000

spark.sql.inMemoryColumnarStorage.compressed: true

spark.sql.inMemoryColumnarStorage.enableVectorizedReader: true

spark.sql.inMemoryColumnarStorage.partitionPruning: true

spark.sql.inMemoryTableScanStatistics.enable: false

spark.sql.join.preferSortMergeJoin: true

spark.sql.json.filterPushdown.enabled: true

spark.sql.jsonGenerator.ignoreNullFields: true

spark.sql.legacy.addSingleFileInAddFile: false

spark.sql.legacy.allowHashOnMapType: false

spark.sql.legacy.allowNegativeScaleOfDecimal: false

spark.sql.legacy.allowParameterlessCount: false

spark.sql.legacy.allowUntypedScalaUDF: false

spark.sql.legacy.bucketedTableScan.outputOrdering: false

spark.sql.legacy.castComplexTypesToString.enabled: false

spark.sql.legacy.charVarcharAsString: false

spark.sql.legacy.createEmptyCollectionUsingStringType: false

spark.sql.legacy.createHiveTableByDefault: true

spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue: false

spark.sql.legacy.doLooseUpcast: false

spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName: true

spark.sql.legacy.exponentLiteralAsDecimal.enabled: false

spark.sql.legacy.extraOptionsBehavior.enabled: false

spark.sql.legacy.followThreeValuedLogicInArrayExists: true

spark.sql.legacy.fromDayTimeString.enabled: false

spark.sql.legacy.integerGroupingId: false

spark.sql.legacy.json.allowEmptyString.enabled: false

spark.sql.legacy.keepCommandOutputSchema: false

spark.sql.legacy.literal.pickMinimumPrecision: true

spark.sql.legacy.notReserveProperties: false

spark.sql.legacy.parseNullPartitionSpecAsStringLiteral: false

spark.sql.legacy.parser.havingWithoutGroupByAsWhere: false

spark.sql.legacy.pathOptionBehavior.enabled: false

spark.sql.legacy.sessionInitWithConfigDefaults: false

spark.sql.legacy.setCommandRejectsSparkCoreConfs: true

spark.sql.legacy.setopsPrecedence.enabled: false

spark.sql.legacy.sizeOfNull: true

spark.sql.legacy.statisticalAggregate: false

spark.sql.legacy.storeAnalyzedPlanForView: false

spark.sql.legacy.typeCoercion.datetimeToString.enabled: false

spark.sql.legacy.useCurrentConfigsForView: false

spark.sql.limit.scaleUpFactor: 4

spark.sql.maxMetadataStringLength: 100

spark.sql.metadataCacheTTLSeconds: -1

spark.sql.objectHashAggregate.sortBased.fallbackThreshold: 128

spark.sql.optimizeNullAwareAntiJoin: true

spark.sql.optimizer.disableHints: false

spark.sql.optimizer.dynamicPartitionPruning.enabled: true

spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio: 0

spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly: true

spark.sql.optimizer.dynamicPartitionPruning.useStats: true

spark.sql.optimizer.enableJsonExpressionOptimization: true

spark.sql.optimizer.expression.nestedPruning.enabled: true

spark.sql.optimizer.inSetConversionThreshold: 10

spark.sql.optimizer.inSetSwitchThreshold: 400

spark.sql.optimizer.maxIterations: 100

spark.sql.optimizer.metadataOnly: false

spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources: parquet,orc

spark.sql.optimizer.nestedSchemaPruning.enabled: true

spark.sql.optimizer.replaceExceptWithFilter: true

spark.sql.optimizer.serializer.nestedSchemaPruning.enabled: true

spark.sql.orderByOrdinal: true

spark.sql.parquet.binaryAsString: false

spark.sql.parquet.columnarReaderBatchSize: 4096

spark.sql.parquet.compression.codec: snappy

spark.sql.parquet.enableVectorizedReader: true

spark.sql.parquet.filterPushdown: true

spark.sql.parquet.filterPushdown.date: true

spark.sql.parquet.filterPushdown.decimal: true

spark.sql.parquet.filterPushdown.string.startsWith: true

spark.sql.parquet.filterPushdown.timestamp: true

spark.sql.parquet.int96AsTimestamp: true

spark.sql.parquet.int96TimestampConversion: false

spark.sql.parquet.mergeSchema: false

spark.sql.parquet.output.committer.class: org.apache.parquet.hadoop.ParquetOutputCommitter

spark.sql.parquet.pushdown.inFilterThreshold: 10

spark.sql.parquet.recordLevelFilter.enabled: false

spark.sql.parquet.respectSummaryFiles: false

spark.sql.parquet.writeLegacyFormat: false

spark.sql.parser.escapedStringLiterals: false

spark.sql.parser.quotedRegexColumnNames: false

spark.sql.pivotMaxValues: 10000

spark.sql.planChangeLog.level: trace

spark.sql.pyspark.jvmStacktrace.enabled: false

spark.sql.repl.eagerEval.enabled: false

spark.sql.repl.eagerEval.maxNumRows: 20

spark.sql.repl.eagerEval.truncate: 20

spark.sql.retainGroupColumns: true

spark.sql.runSQLOnFiles: true

spark.sql.scriptTransformation.exitTimeoutInSeconds: 5s

spark.sql.selfJoinAutoResolveAmbiguity: true

spark.sql.shuffle.partitions: 200

spark.sql.sort.enableRadixSort: true

spark.sql.sources.binaryFile.maxLength: 2147483647

spark.sql.sources.bucketing.autoBucketedScan.enabled: true

spark.sql.sources.bucketing.enabled: true

spark.sql.sources.bucketing.maxBuckets: 100000

spark.sql.sources.commitProtocolClass: org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

spark.sql.sources.default: parquet

spark.sql.sources.fileCompressionFactor: 1

spark.sql.sources.ignoreDataLocality: false

spark.sql.sources.parallelPartitionDiscovery.parallelism: 10000

spark.sql.sources.parallelPartitionDiscovery.threshold: 32

spark.sql.sources.partitionColumnTypeInference.enabled: true

spark.sql.sources.validatePartitionColumns: true

spark.sql.statistics.fallBackToHdfs: false

spark.sql.statistics.histogram.enabled: false

spark.sql.statistics.histogram.numBins: 254

spark.sql.statistics.ndv.maxError: 0

spark.sql.statistics.parallelFileListingInStatsComputation.enabled: true

spark.sql.statistics.percentile.accuracy: 10000

spark.sql.statistics.size.autoUpdate.enabled: false

spark.sql.streaming.continuous.epochBacklogQueueSize: 10000

spark.sql.streaming.continuous.executorPollIntervalMs: 100

spark.sql.streaming.continuous.executorQueueSize: 1024

spark.sql.streaming.metricsEnabled: true

spark.sql.subexpressionElimination.cache.maxEntries: 100

spark.sql.subexpressionElimination.enabled: true

spark.sql.subquery.maxThreadThreshold: 16

spark.sql.thriftServer.incrementalCollect: false

spark.sql.thriftServer.queryTimeout: 20s

spark.sql.thriftserver.ui.retainedSessions: 200

spark.sql.thriftserver.ui.retainedStatements: 200

spark.sql.truncateTable.ignorePermissionAcl.enabled: false

spark.sql.ui.explainMode: formatted

spark.sql.ui.retainedExecutions: 500

spark.sql.variable.substitute: true

spark.sql.view.maxNestedViewDepth: 100

spark.sql.warehouse.dir: file:/opt/spark/work-dir/spark-warehouse

spark.sql.windowExec.buffer.in.memory.threshold: 4096

spark.stage.maxConsecutiveAttempts: 4

spark.storage.replication.proactive: true

spark.submit.deployMode: cluster

spark.submit.pyFiles:

spark.task.cpus: 1

spark.task.maxFailures: 4

spark.task.reaper.enabled: true

spark.task.reaper.killTimeout: -1

spark.task.reaper.pollingInterval: 20s

spark.task.reaper.threadDump: true

Any quick help will be greatly appreciated.

Attachments

Issue Links

is related to

HADOOP-18179 Boost S3A Stream Read Performance

Open

Activity

People

Assignee:: Unassigned

Reporter:: Abhinav Kumar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Aug/21 04:13

Updated:: 12/Jun/23 21:49

Resolved:: 09/Aug/21 13:38