Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059

Pig on Spark

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch, 0.17.0
    • spark
    • Reviewed

    Description

      Setting up your development environment:
      0. download spark release package(currently pig on spark only support spark 1.6).
      1. Check out Pig Spark branch.

      2. Build Pig by running "ant jar" and "ant -Dhadoopversion=23 jar" for hadoop-2.x versions

      3. Configure these environmental variables:
      export HADOOP_USER_CLASSPATH_FIRST="true"
      Now we support “local” and "yarn-client" mode, you can export system variable “SPARK_MASTER” like:
      export SPARK_MASTER=local or export SPARK_MASTER="yarn-client"

      4. In local mode: ./pig -x spark_local xxx.pig
      In yarn-client mode:
      export SPARK_HOME=xx;
      export SPARK_JAR=hdfs://example.com:8020/xxxx (the hdfs location where you upload the spark-assembly*.jar)
      ./pig -x spark xxx.pig

      Attachments

        1. Pig-on-Spark-Design-Doc.pdf
          82 kB
          Praveen Rachabattuni
        2. Pig-on-Spark-Scope.pdf
          549 kB
          Mohit Sabharwal

        Issue Links

          1.
          Initial implementation of Pig on Spark Sub-task Closed Praveen Rachabattuni
          2.
          Initial implementation of unit tests for Pig on Spark Sub-task Closed liyunzhang
          3.
          Move to Spark 1.x Sub-task Closed Richard Ding
          4.
          e2e tests for Spark Sub-task Closed Praveen Rachabattuni
          5.
          Fix classpath error when using pig command with Spark Sub-task Resolved liyunzhang
          6.
          Make collected group work with Spark Sub-task Closed Praveen Rachabattuni
          7.
          Make cross join work with Spark Sub-task Resolved Mohit Sabharwal
          8.
          Implement replicated join in Spark engine Sub-task Closed Mohit Sabharwal
          9.
          Make skewed join work with Spark Sub-task Closed Praveen Rachabattuni
          10.
          Make ruby udfs work with Spark Sub-task Closed liyunzhang
          11.
          Make merge join work with Spark engine Sub-task Resolved Praveen Rachabattuni
          12.
          Make python udfs work with Spark Sub-task Closed liyunzhang
          13.
          Make merge-sparse join work with Spark Sub-task Closed Abhishek Agarwal
          14.
          Make stream work with Spark Sub-task Closed liyunzhang
          15.
          Copy spark dependencies to lib directory Sub-task Closed Praveen Rachabattuni
          16.
          Make rank work with Spark Sub-task Closed Carlos Balduz
          17.
          UDFContext is not initialized in executors when running on Spark cluster Sub-task Closed liyunzhang
          18.
          Package pig along with dependencies into a fat jar while job submission to Spark cluster Sub-task Closed Praveen Rachabattuni
          19.
          Avoid packaging spark specific jars into pig fat jar Sub-task Closed Unassigned
          20.
          Add SparkPlan in spark package Sub-task Closed liyunzhang
          21.
          Add stats and error reporting for Spark Sub-task Closed Mohit Sabharwal
          22.
          Move to Spark 1.2 Sub-task Closed Mohit Sabharwal
          23.
          Merge from trunk (1) [Spark Branch] Sub-task Closed Praveen Rachabattuni
          24.
          Merge from trunk (2) [Spark Branch] Sub-task Closed Praveen Rachabattuni
          25.
          Upgrade to Spark 1.3 Sub-task Closed Mohit Sabharwal
          26.
          change from "SparkLauncher#physicalToRDD" to "SparkLauncher#sparkPlanToRDD" after using spark plan in SparkLauncher Sub-task Closed liyunzhang
          27.
          Implement MergeJoin (as regular join) for Spark engine Sub-task Closed Mohit Sabharwal
          28.
          implement visitSkewedJoin in SparkCompiler Sub-task Closed liyunzhang
          29.
          Fix the NPE of System.getenv("SPARK_MASTER") in SparkLauncher.java Sub-task Closed liyunzhang
          30.
          remove unnessary MR plan code generated in SparkLauncher.java Sub-task Resolved liyunzhang
          31.
          Make ship work with spark Sub-task Closed liyunzhang
          32.
          PackageConverter hanging in Spark Sub-task Patch Available Carlos Balduz
          33.
          StackOverflowError in LIMIT operation on Spark Sub-task Patch Available Carlos Balduz
          34.
          Error when there is a bag inside an RDD Sub-task Closed Carlos Balduz
          35.
          "pig.output.lazy" not works in spark mode Sub-task Closed liyunzhang
          36.
          e2e tests for Spark can not work in hadoop env Sub-task Closed liyunzhang
          37.
          SchemaTupleBackend error when working on a Spark 1.1.0 cluster Sub-task Open Unassigned
          38.
          Order By error after Group By in Spark Sub-task Closed Unassigned
          39.
          Limit after sort does not work in spark mode Sub-task Closed liyunzhang
          40.
          Sort the leaves by SparkOperator.operatorKey in SparkLauncher#sparkOperToRDD Sub-task Closed liyunzhang
          41.
          Remove redundant code, comments in SparkLauncher Sub-task Closed Praveen Rachabattuni
          42.
          Add apache license header to all spark package source files Sub-task Closed Praveen Rachabattuni
          43.
          Enable Secondary key sort feature in spark mode Sub-task Closed liyunzhang
          44.
          Remove unnecessary store and load when POSplit is encounted Sub-task Closed liyunzhang
          45.
          SparkOperator should correspond to complete Spark job Sub-task Closed Mohit Sabharwal
          46.
          Enable local mode tests for Spark engine Sub-task Closed Mohit Sabharwal
          47.
          Remove repetitive org.apache.pig.test.Util#isSparkExecType Sub-task Closed liyunzhang
          48.
          OutputConsumerIterator should flush buffered records Sub-task Resolved Mohit Sabharwal
          49.
          Set CROSS operation parallelism for Spark engine Sub-task Closed Mohit Sabharwal
          50.
          Fix POGlobalRearrangeSpark copy constructor for Spark engine Sub-task Closed Mohit Sabharwal
          51.
          Modify the test.output value from "no" to "yes" to show more error message Sub-task Closed liyunzhang
          52.
          Support custom MR partitioners for Spark engine Sub-task Closed Mohit Sabharwal
          53.
          Fix unit test failure in TestSecondarySortSpark Sub-task Closed liyunzhang
          54.
          Pass value to MR Partitioners in Spark engine Sub-task Open Mohit Sabharwal
          55.
          Use "cogroup" spark api to implement "groupby+secondarysort" case in GlobalRearrangeConverter.java Sub-task Closed liyunzhang
          56.
          Enable "TestPruneColumn" in spark mode Sub-task Closed Xianda Ke
          57.
          Use newAPIHadoopRDD instead of newAPIHadoopFile Sub-task Closed Mohit Sabharwal
          58.
          Cleanup: Rename POConverter to RDDConverter Sub-task Closed Mohit Sabharwal
          59.
          Move tests under 'test-spark' target Sub-task Closed Mohit Sabharwal
          60.
          Fix unit test failure in TestCase Sub-task Closed Xianda Ke
          61.
          Enable "TestMultiQueryLocal" in spark mode Sub-task Closed liyunzhang
          62.
          Enable "TestMultiQuery" in spark mode Sub-task Closed liyunzhang
          63.
          Fix unit test failures about TestFRJoinNullValue in spark mode Sub-task Closed liyunzhang
          64.
          Fix unit test failures about MergeJoinConverter in spark mode Sub-task Closed liyunzhang
          65.
          Enable "TestNullConstant" unit test in spark mode Sub-task Closed Xianda Ke
          66.
          Implement Merge CoGroup for Spark engine Sub-task Closed liyunzhang
          67.
          Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule Sub-task Closed liyunzhang
          68.
          fix a bug when coping Jar to SparkJob working directory Sub-task Closed Xianda Ke
          69.
          Enable "TestDefaultDateTimeZone" unit tests in spark mode Sub-task Closed liyunzhang
          70.
          Enable "TestRank1","TestRank3" unit tests in spark mode Sub-task Closed Xianda Ke
          71.
          Enable "TestOrcStorage“ unit test in spark mode Sub-task Closed liyunzhang
          72.
          Fix remaining unit test failures about "TestHBaseStorage" in spark mode Sub-task Closed liyunzhang
          73.
          Fix unit test failures about TestAssert Sub-task Closed Xianda Ke
          74.
          Enable "TestLocationInPhysicalPlan" in spark mode Sub-task Closed liyunzhang
          75.
          Fix null keys join in SkewedJoin in spark mode Sub-task Closed liyunzhang
          76.
          Fix UT errors of TestPigRunner in Spark mode Sub-task Closed Xianda Ke
          77.
          Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space Sub-task Closed liyunzhang
          78.
          Enable Illustrate in spark Sub-task In Progress Jakov Rabinovits
          79.
          Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate Sub-task Closed liyunzhang
          80.
          Update hadoop version to enable Spark output statistics Sub-task Closed Xianda Ke
          81.
          Fix records count issues in output statistics Sub-task Closed Xianda Ke
          82.
          Support hadoop-like Counter using spark accumulator Sub-task Closed Xianda Ke
          83.
          Support InputStats in spark mode Sub-task Closed Xianda Ke
          84.
          Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript Sub-task Closed Xianda Ke
          85.
          Add Spark Unit Tests for SparkPigStats Sub-task Open Xianda Ke
          86.
          Fix UT failures in TestPigServerLocal Sub-task Closed Xianda Ke
          87.
          Enable Pig on Spark to run on Yarn Client mode Sub-task Closed Srikanth Sundarrajan
          88.
          Operators with multiple predecessors fail under multiquery optimization Sub-task Closed liyunzhang
          89.
          Enable Pig on Spark to run on Yarn Cluster mode Sub-task Resolved Srikanth Sundarrajan
          90.
          Class conflicts: Kryo bundled in spark vs kryo bundled with pig Sub-task Closed Srikanth Sundarrajan
          91.
          Enable dynamic resource allocation/de-allocation on Yarn backends Sub-task Closed Srikanth Sundarrajan
          92.
          Support combine for spark mode Sub-task Closed Pallavi Rao
          93.
          Tests in TestCombiner fail due to missing leveldb dependency Sub-task Closed Pallavi Rao
          94.
          Spark related JARs are not included when importing project via IDE Sub-task Closed Xianda Ke
          95.
          the value of $SPARK_DIST_CLASSPATH in pig file is invalid Sub-task Resolved liyunzhang
          96.
          Ensure spark can be run as PIG action in Oozie Sub-task Open Prateek Vaishnav
          97.
          Fix UT failures in TestScriptLanguage Sub-task Closed Xianda Ke
          98.
          Ensure GroupBy is optimized for all algebraic Operations Sub-task Closed Pallavi Rao
          99.
          Refactor SparkLauncher for spark engine Sub-task Closed liyunzhang
          100.
          Enable "pig.disable.counter“ for spark engine Sub-task Closed liyunzhang
          101.
          the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine Sub-task Resolved liyunzhang
          102.
          Implement to collect metric data like getSMMSpillCount() in SparkJobStats Sub-task Open Unassigned
          103.
          Merge trunk[3] into spark branch Sub-task Closed Pallavi Rao
          104.
          Collected group doesn't work in some cases Sub-task Closed Xianda Ke
          105.
          pig.noSplitCombination=true should always be set internally for a merge join Sub-task Closed Xianda Ke
          106.
          Merge trunk[4] into spark branch Sub-task Closed Pallavi Rao
          107.
          Last record is missing in STREAM operator Sub-task Closed Xianda Ke
          108.
          Need upgrade snappy-java.version to 1.1.1.3 Sub-task Closed liyunzhang
          109.
          OutputConsumeIterator can't handle the last buffered tuples for some Operators Sub-task Closed Xianda Ke
          110.
          Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode Sub-task Resolved liyunzhang
          111.
          Implement FR join by broadcasting small rdd not making more copys of data Sub-task Closed Nándor Kollár
          112.
          Fix unit test failure after PIG-4771's patch was checked in Sub-task Closed liyunzhang
          113.
          The number of records of input file is calculated wrongly in spark mode in multiquery case Sub-task Closed Ádám Szita
          114.
          Fail to use Javascript UDF in spark yarn client mode Sub-task Closed liyunzhang
          115.
          Commit changes from last round of review on rb Sub-task Closed liyunzhang
          116.
          Remove schema tuple reference overhead for replicate join hashmap in POFRJoinSpark Sub-task Open Unassigned
          117.
          Upgrade spark to 2.0 Sub-task Closed liyunzhang
          118.
          Replace IndexedKey with PigNullableWritable in spark branch Sub-task Resolved Unassigned
          119.
          exclude jline in spark dependency Sub-task Closed Ádám Szita
          120.
          Duplicate record key info in GlobalRearrangeConverter#ToGroupKeyValueFunction Sub-task Closed liyunzhang
          121.
          Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test in spark mode Sub-task Open Unassigned
          122.
          Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats Sub-task Open Unassigned
          123.
          Specify the hdfs path directly to spark and avoid the unnecessary download and upload in SparkLauncher.java Sub-task Open Nándor Kollár
          124.
          Implement auto parallelism for pig on spark Sub-task Open Unassigned

          Activity

            People

              praveenr019 Praveen Rachabattuni
              rohini Rohini Palaniswamy
              Votes:
              22 Vote for this issue
              Watchers:
              60 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: