Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4856 Optimization for pig on spark
  3. PIG-5029

Optimize sort case when data is skewed

Add voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • spark-branch
    • spark
    • None

    Description

      In PigMix L9.pig

      register $PIGMIX_JAR
      A = load '$HDFS_ROOT/page_views' using org.apache.pig.test.pigmix.udf.PigPerformanceLoader()
          as (user, action, timespent, query_term, ip_addr, timestamp,
              estimated_revenue, page_info, page_links);
      B = order A by query_term parallel $PARALLEL;
      store B into '$PIGMIX_OUTPUT/L9out';
      

      The pig physical plan will be changed to spark plan and to spark lineage:

      [main] 2016-09-08 01:49:09,844 DEBUG converter.StoreConverter (StoreConverter.java:convert(110)) - RDD lineage: (23) MapPartitionsRDD[8] at map at StoreConverter.java:80 []
       |   MapPartitionsRDD[7] at mapPartitions at SortConverter.java:58 []
       |   ShuffledRDD[6] at sortByKey at SortConverter.java:56 []
       +-(23) MapPartitionsRDD[3] at map at SortConverter.java:49 []
          |   MapPartitionsRDD[2] at mapPartitions at ForEachConverter.java:64 []
          |   MapPartitionsRDD[1] at map at LoadConverter.java:127 []
          |   NewHadoopRDD[0] at newAPIHadoopRDD at LoadConverter.java:102 []
      

      We use sortByKey to implement the sort feature. Although RangePartitioner is used by RDD.sortByKey and RangePartitiner will sample data and ranges the key roughly into equal range, the test result(attached document) shows that one partition will load most keys and take long time to finish.

      Attachments

        1. SkewedData_L9.docx
          1.10 MB
          liyunzhang
        2. PIG-5051_5029_5.patch
          11 kB
          liyunzhang
        3. PIG-5029.patch
          8 kB
          liyunzhang
        4. PIG-5029_3.patch
          51 kB
          liyunzhang
        5. PIG-5029_2.patch
          10 kB
          liyunzhang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kellyzly liyunzhang
            kellyzly liyunzhang

            Dates

              Created:
              Updated:

              Slack

                Issue deployment