[SPARK-9072] Parquet : Writing data to S3 very slowly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
- parquet

Description

I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.

There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR

Results :

1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.

2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.

Input : s3n://<SameBucket>/input
Output : s3n://<SameBucket>/output/parquet

Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.

Note :
Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.

Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks

Attachments

Issue Links

contains

SPARK-7837 NPE when save as parquet in speculative tasks

Resolved

is related to

SPARK-8125 Accelerate ParquetRelation2 metadata discovery

Resolved

relates to

SPARK-8406 Race condition when writing Parquet files

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Murtaza Kanchwala

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jul/15 18:11

Updated:: 15/Jul/15 19:06

Resolved:: 15/Jul/15 18:39