Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059

Pig on Spark

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: spark-branch, 0.17.0
    • Component/s: spark
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Setting up your development environment:
      0. download spark release package(currently pig on spark only support spark 1.6).
      1. Check out Pig Spark branch.

      2. Build Pig by running "ant jar" and "ant -Dhadoopversion=23 jar" for hadoop-2.x versions

      3. Configure these environmental variables:
      export HADOOP_USER_CLASSPATH_FIRST="true"
      Now we support “local” and "yarn-client" mode, you can export system variable “SPARK_MASTER” like:
      export SPARK_MASTER=local or export SPARK_MASTER="yarn-client"

      4. In local mode: ./pig -x spark_local xxx.pig
      In yarn-client mode:
      export SPARK_HOME=xx;
      export SPARK_JAR=hdfs://example.com:8020/xxxx (the hdfs location where you upload the spark-assembly*.jar)
      ./pig -x spark xxx.pig

        Attachments

        1. Pig-on-Spark-Design-Doc.pdf
          82 kB
          Praveen Rachabattuni
        2. Pig-on-Spark-Scope.pdf
          549 kB
          Mohit Sabharwal

        Issue Links

        1.
        Initial implementation of Pig on Spark Sub-task Closed Praveen Rachabattuni Actions
        2.
        Initial implementation of unit tests for Pig on Spark Sub-task Closed liyunzhang Actions
        3.
        Move to Spark 1.x Sub-task Closed Richard Ding Actions
        4.
        e2e tests for Spark Sub-task Closed Praveen Rachabattuni Actions
        5.
        Fix classpath error when using pig command with Spark Sub-task Resolved liyunzhang Actions
        6.
        Make collected group work with Spark Sub-task Closed Praveen Rachabattuni Actions
        7.
        Make cross join work with Spark Sub-task Resolved Mohit Sabharwal Actions
        8.
        Implement replicated join in Spark engine Sub-task Closed Mohit Sabharwal Actions
        9.
        Make skewed join work with Spark Sub-task Closed Praveen Rachabattuni Actions
        10.
        Make ruby udfs work with Spark Sub-task Closed liyunzhang Actions
        11.
        Make merge join work with Spark engine Sub-task Resolved Praveen Rachabattuni Actions
        12.
        Make python udfs work with Spark Sub-task Closed liyunzhang Actions
        13.
        Make merge-sparse join work with Spark Sub-task Closed Abhishek Agarwal Actions
        14.
        Make stream work with Spark Sub-task Closed liyunzhang Actions
        15.
        Copy spark dependencies to lib directory Sub-task Closed Praveen Rachabattuni Actions
        16.
        Make rank work with Spark Sub-task Closed Carlos Balduz Actions
        17.
        UDFContext is not initialized in executors when running on Spark cluster Sub-task Closed liyunzhang Actions
        18.
        Package pig along with dependencies into a fat jar while job submission to Spark cluster Sub-task Closed Praveen Rachabattuni Actions
        19.
        Avoid packaging spark specific jars into pig fat jar Sub-task Closed Unassigned Actions
        20.
        Add SparkPlan in spark package Sub-task Closed liyunzhang Actions
        21.
        Add stats and error reporting for Spark Sub-task Closed Mohit Sabharwal Actions
        22.
        Move to Spark 1.2 Sub-task Closed Mohit Sabharwal Actions
        23.
        Merge from trunk (1) [Spark Branch] Sub-task Closed Praveen Rachabattuni Actions
        24.
        Merge from trunk (2) [Spark Branch] Sub-task Closed Praveen Rachabattuni Actions
        25.
        Upgrade to Spark 1.3 Sub-task Closed Mohit Sabharwal Actions
        26.
        change from "SparkLauncher#physicalToRDD" to "SparkLauncher#sparkPlanToRDD" after using spark plan in SparkLauncher Sub-task Closed liyunzhang Actions
        27.
        Implement MergeJoin (as regular join) for Spark engine Sub-task Closed Mohit Sabharwal Actions
        28.
        implement visitSkewedJoin in SparkCompiler Sub-task Closed liyunzhang Actions
        29.
        Fix the NPE of System.getenv("SPARK_MASTER") in SparkLauncher.java Sub-task Closed liyunzhang Actions
        30.
        remove unnessary MR plan code generated in SparkLauncher.java Sub-task Resolved liyunzhang Actions
        31.
        Make ship work with spark Sub-task Closed liyunzhang Actions
        32.
        PackageConverter hanging in Spark Sub-task Patch Available Carlos Balduz Actions
        33.
        StackOverflowError in LIMIT operation on Spark Sub-task Patch Available Carlos Balduz Actions
        34.
        Error when there is a bag inside an RDD Sub-task Closed Carlos Balduz Actions
        35.
        "pig.output.lazy" not works in spark mode Sub-task Closed liyunzhang Actions
        36.
        e2e tests for Spark can not work in hadoop env Sub-task Closed liyunzhang Actions
        37.
        SchemaTupleBackend error when working on a Spark 1.1.0 cluster Sub-task Open Unassigned Actions
        38.
        Order By error after Group By in Spark Sub-task Closed Unassigned Actions
        39.
        Limit after sort does not work in spark mode Sub-task Closed liyunzhang Actions
        40.
        Sort the leaves by SparkOperator.operatorKey in SparkLauncher#sparkOperToRDD Sub-task Closed liyunzhang Actions
        41.
        Remove redundant code, comments in SparkLauncher Sub-task Closed Praveen Rachabattuni Actions
        42.
        Add apache license header to all spark package source files Sub-task Closed Praveen Rachabattuni Actions
        43.
        Enable Secondary key sort feature in spark mode Sub-task Closed liyunzhang Actions
        44.
        Remove unnecessary store and load when POSplit is encounted Sub-task Closed liyunzhang Actions
        45.
        SparkOperator should correspond to complete Spark job Sub-task Closed Mohit Sabharwal Actions
        46.
        Enable local mode tests for Spark engine Sub-task Closed Mohit Sabharwal Actions
        47.
        Remove repetitive org.apache.pig.test.Util#isSparkExecType Sub-task Closed liyunzhang Actions
        48.
        OutputConsumerIterator should flush buffered records Sub-task Resolved Mohit Sabharwal Actions
        49.
        Set CROSS operation parallelism for Spark engine Sub-task Closed Mohit Sabharwal Actions
        50.
        Fix POGlobalRearrangeSpark copy constructor for Spark engine Sub-task Closed Mohit Sabharwal Actions
        51.
        Modify the test.output value from "no" to "yes" to show more error message Sub-task Closed liyunzhang Actions
        52.
        Support custom MR partitioners for Spark engine Sub-task Closed Mohit Sabharwal Actions
        53.
        Fix unit test failure in TestSecondarySortSpark Sub-task Closed liyunzhang Actions
        54.
        Pass value to MR Partitioners in Spark engine Sub-task Open Mohit Sabharwal Actions
        55.
        Use "cogroup" spark api to implement "groupby+secondarysort" case in GlobalRearrangeConverter.java Sub-task Closed liyunzhang Actions
        56.
        Enable "TestPruneColumn" in spark mode Sub-task Closed Xianda Ke Actions
        57.
        Use newAPIHadoopRDD instead of newAPIHadoopFile Sub-task Closed Mohit Sabharwal Actions
        58.
        Cleanup: Rename POConverter to RDDConverter Sub-task Closed Mohit Sabharwal Actions
        59.
        Move tests under 'test-spark' target Sub-task Closed Mohit Sabharwal Actions
        60.
        Fix unit test failure in TestCase Sub-task Closed Xianda Ke Actions
        61.
        Enable "TestMultiQueryLocal" in spark mode Sub-task Closed liyunzhang Actions
        62.
        Enable "TestMultiQuery" in spark mode Sub-task Closed liyunzhang Actions
        63.
        Fix unit test failures about TestFRJoinNullValue in spark mode Sub-task Closed liyunzhang Actions
        64.
        Fix unit test failures about MergeJoinConverter in spark mode Sub-task Closed liyunzhang Actions
        65.
        Enable "TestNullConstant" unit test in spark mode Sub-task Closed Xianda Ke Actions
        66.
        Implement Merge CoGroup for Spark engine Sub-task Closed liyunzhang Actions
        67.
        Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule Sub-task Closed liyunzhang Actions
        68.
        fix a bug when coping Jar to SparkJob working directory Sub-task Closed Xianda Ke Actions
        69.
        Enable "TestDefaultDateTimeZone" unit tests in spark mode Sub-task Closed liyunzhang Actions
        70.
        Enable "TestRank1","TestRank3" unit tests in spark mode Sub-task Closed Xianda Ke Actions
        71.
        Enable "TestOrcStorage“ unit test in spark mode Sub-task Closed liyunzhang Actions
        72.
        Fix remaining unit test failures about "TestHBaseStorage" in spark mode Sub-task Closed liyunzhang Actions
        73.
        Fix unit test failures about TestAssert Sub-task Closed Xianda Ke Actions
        74.
        Enable "TestLocationInPhysicalPlan" in spark mode Sub-task Closed liyunzhang Actions
        75.
        Fix null keys join in SkewedJoin in spark mode Sub-task Closed liyunzhang Actions
        76.
        Fix UT errors of TestPigRunner in Spark mode Sub-task Closed Xianda Ke Actions
        77.
        Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space Sub-task Closed liyunzhang Actions
        78.
        Enable Illustrate in spark Sub-task In Progress Jakov Rabinovits Actions
        79.
        Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate Sub-task Closed liyunzhang Actions
        80.
        Update hadoop version to enable Spark output statistics Sub-task Closed Xianda Ke Actions
        81.
        Fix records count issues in output statistics Sub-task Closed Xianda Ke Actions
        82.
        Support hadoop-like Counter using spark accumulator Sub-task Closed Xianda Ke Actions
        83.
        Support InputStats in spark mode Sub-task Closed Xianda Ke Actions
        84.
        Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript Sub-task Closed Xianda Ke Actions
        85.
        Add Spark Unit Tests for SparkPigStats Sub-task Open Xianda Ke Actions
        86.
        Fix UT failures in TestPigServerLocal Sub-task Closed Xianda Ke Actions
        87.
        Enable Pig on Spark to run on Yarn Client mode Sub-task Closed Srikanth Sundarrajan Actions
        88.
        Operators with multiple predecessors fail under multiquery optimization Sub-task Closed liyunzhang Actions
        89.
        Enable Pig on Spark to run on Yarn Cluster mode Sub-task Resolved Srikanth Sundarrajan Actions
        90.
        Class conflicts: Kryo bundled in spark vs kryo bundled with pig Sub-task Closed Srikanth Sundarrajan Actions
        91.
        Enable dynamic resource allocation/de-allocation on Yarn backends Sub-task Closed Srikanth Sundarrajan Actions
        92.
        Support combine for spark mode Sub-task Closed Pallavi Rao Actions
        93.
        Tests in TestCombiner fail due to missing leveldb dependency Sub-task Closed Pallavi Rao Actions
        94.
        Spark related JARs are not included when importing project via IDE Sub-task Closed Xianda Ke Actions
        95.
        the value of $SPARK_DIST_CLASSPATH in pig file is invalid Sub-task Resolved liyunzhang Actions
        96.
        Ensure spark can be run as PIG action in Oozie Sub-task Open Prateek Vaishnav Actions
        97.
        Fix UT failures in TestScriptLanguage Sub-task Closed Xianda Ke Actions
        98.
        Ensure GroupBy is optimized for all algebraic Operations Sub-task Closed Pallavi Rao Actions
        99.
        Refactor SparkLauncher for spark engine Sub-task Closed liyunzhang Actions
        100.
        Enable "pig.disable.counter“ for spark engine Sub-task Closed liyunzhang Actions
        101.
        the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine Sub-task Resolved liyunzhang Actions
        102.
        Implement to collect metric data like getSMMSpillCount() in SparkJobStats Sub-task Open Unassigned Actions
        103.
        Merge trunk[3] into spark branch Sub-task Closed Pallavi Rao Actions
        104.
        Collected group doesn't work in some cases Sub-task Closed Xianda Ke Actions
        105.
        pig.noSplitCombination=true should always be set internally for a merge join Sub-task Closed Xianda Ke Actions
        106.
        Merge trunk[4] into spark branch Sub-task Closed Pallavi Rao Actions
        107.
        Last record is missing in STREAM operator Sub-task Closed Xianda Ke Actions
        108.
        Need upgrade snappy-java.version to 1.1.1.3 Sub-task Closed liyunzhang Actions
        109.
        OutputConsumeIterator can't handle the last buffered tuples for some Operators Sub-task Closed Xianda Ke Actions
        110.
        Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode Sub-task Resolved liyunzhang Actions
        111.
        Implement FR join by broadcasting small rdd not making more copys of data Sub-task Closed Nándor Kollár Actions
        112.
        Fix unit test failure after PIG-4771's patch was checked in Sub-task Closed liyunzhang Actions
        113.
        The number of records of input file is calculated wrongly in spark mode in multiquery case Sub-task Closed Ádám Szita Actions
        114.
        Fail to use Javascript UDF in spark yarn client mode Sub-task Closed liyunzhang Actions
        115.
        Commit changes from last round of review on rb Sub-task Closed liyunzhang Actions
        116.
        Remove schema tuple reference overhead for replicate join hashmap in POFRJoinSpark Sub-task Open Unassigned Actions
        117.
        Upgrade spark to 2.0 Sub-task Closed liyunzhang Actions
        118.
        Replace IndexedKey with PigNullableWritable in spark branch Sub-task Resolved Unassigned Actions
        119.
        exclude jline in spark dependency Sub-task Closed Ádám Szita Actions
        120.
        Duplicate record key info in GlobalRearrangeConverter#ToGroupKeyValueFunction Sub-task Closed liyunzhang Actions
        121.
        Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test in spark mode Sub-task Open Unassigned Actions
        122.
        Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats Sub-task Open Unassigned Actions
        123.
        Specify the hdfs path directly to spark and avoid the unnecessary download and upload in SparkLauncher.java Sub-task Open Nándor Kollár Actions
        124.
        Implement auto parallelism for pig on spark Sub-task Open Unassigned Actions

          Activity

            People

            • Assignee:
              praveenr019 Praveen Rachabattuni
              Reporter:
              rohini Rohini Palaniswamy

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment