Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.

      There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR

      Results :

      1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.

      2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.

      Input : s3n://<SameBucket>/input
      Output : s3n://<SameBucket>/output/parquet

      Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.

      Note :
      Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.

      Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          Yes, I'm really only suggesting this start at user@. If it turns out to be something that needs to change in Spark, then we make a JIRA with details. There may be no (new, separate) issue here, but something that's already resolved or has an easy answer.

          Show
          srowen Sean Owen added a comment - Yes, I'm really only suggesting this start at user@. If it turns out to be something that needs to change in Spark, then we make a JIRA with details. There may be no (new, separate) issue here, but something that's already resolved or has an easy answer.
          Show
          mkanchwala Murtaza Kanchwala added a comment - Yin Huai I've posted the same on the user mailing list. http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-Writing-data-to-s3-slowly-td23863.html
          Hide
          yhuai Yin Huai added a comment -

          Murtaza Kanchwala I think our user mailing list is a good place for this kind of discussions.

          Show
          yhuai Yin Huai added a comment - Murtaza Kanchwala I think our user mailing list is a good place for this kind of discussions.
          Hide
          yhuai Yin Huai added a comment -

          Also cc Cheng Lian

          Show
          yhuai Yin Huai added a comment - Also cc Cheng Lian
          Hide
          mkanchwala Murtaza Kanchwala added a comment - - edited

          Hi Sean Owen

          I would like to know where I can file this kind of things and let you people know that there is some serious problems which can affect production environments

          I would be glad if you let me know for future issues and updates

          Thanks in Advance
          Thanks Yin Huai

          Show
          mkanchwala Murtaza Kanchwala added a comment - - edited Hi Sean Owen I would like to know where I can file this kind of things and let you people know that there is some serious problems which can affect production environments I would be glad if you let me know for future issues and updates Thanks in Advance Thanks Yin Huai
          Hide
          srowen Sean Owen added a comment -

          Murtaza Kanchwala Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark There are some problems here (don't set critical; you shouldn't have set fix version). But this also looks like a question and something you're investigating. It's not suitable as a JIRA since you don't have a clear issue to report.

          Show
          srowen Sean Owen added a comment - Murtaza Kanchwala Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark There are some problems here (don't set critical; you shouldn't have set fix version). But this also looks like a question and something you're investigating. It's not suitable as a JIRA since you don't have a clear issue to report.
          Hide
          yhuai Yin Huai added a comment - - edited

          Can you try to set spark.sql.parquet.output.committer.class to org.apache.spark.sql.parquet.DirectParquetOutputCommitter in your hadoop conf once Spark 1.4.1 is out? Also, since you have a large job. It is very possible that the metadata operations on parquet files are slow, which will be addressed by https://issues.apache.org/jira/browse/SPARK-8125 (PR: https://github.com/apache/spark/pull/7396).

          Show
          yhuai Yin Huai added a comment - - edited Can you try to set spark.sql.parquet.output.committer.class to org.apache.spark.sql.parquet.DirectParquetOutputCommitter in your hadoop conf once Spark 1.4.1 is out? Also, since you have a large job. It is very possible that the metadata operations on parquet files are slow, which will be addressed by https://issues.apache.org/jira/browse/SPARK-8125 (PR: https://github.com/apache/spark/pull/7396 ).
          Hide
          mkanchwala Murtaza Kanchwala added a comment -

          I also faced some task failures but it was harmless.

          Show
          mkanchwala Murtaza Kanchwala added a comment - I also faced some task failures but it was harmless.
          Hide
          mkanchwala Murtaza Kanchwala added a comment -

          Might be this is the Issue here

          Show
          mkanchwala Murtaza Kanchwala added a comment - Might be this is the Issue here
          Show
          mkanchwala Murtaza Kanchwala added a comment - Some of the relevant things which I googled http://search-hadoop.com/m/q3RTtzU2FI1Mo1QA&subj=Spark+will+process+_temporary+folder+on+S3+is+very+slow+and+always+cause+failure http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards?lq=1 http://stackoverflow.com/questions/26332542/saving-a-25t-schemardd-in-parquet-format-on-s3 https://forums.databricks.com/questions/1097/stall-on-loading-many-parquet-files-on-s3.html

            People

            • Assignee:
              Unassigned
              Reporter:
              mkanchwala Murtaza Kanchwala
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development