• Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:


      I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.

      There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR

      Results :

      1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.

      2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.

      Input : s3n://<SameBucket>/input
      Output : s3n://<SameBucket>/output/parquet

      Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.

      Note :
      Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.

      Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks


          Issue Links



              • Assignee:
                mkanchwala Murtaza Kanchwala
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: