Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.

      There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR

      Results :

      1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.

      2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.

      Input : s3n://<SameBucket>/input
      Output : s3n://<SameBucket>/output/parquet

      Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.

      Note :
      Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.

      Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mkanchwala Murtaza Kanchwala
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: