Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • None
    • None
    • SQL

    Description

      I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.

      There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR

      Results :

      1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.

      2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.

      Input : s3n://<SameBucket>/input
      Output : s3n://<SameBucket>/output/parquet

      Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.

      Note :
      Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.

      Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mkanchwala Murtaza Kanchwala
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: