[PIG-4059] Pig on Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: spark-branch, 0.17.0
Component/s: spark
Labels:
- spork

Hadoop Flags:

Reviewed

Description

Setting up your development environment:
0. download spark release package(currently pig on spark only support spark 1.6).
1. Check out Pig Spark branch.

2. Build Pig by running "ant jar" and "ant -Dhadoopversion=23 jar" for hadoop-2.x versions

3. Configure these environmental variables:
export HADOOP_USER_CLASSPATH_FIRST="true"
Now we support “local” and "yarn-client" mode, you can export system variable “SPARK_MASTER” like:
export SPARK_MASTER=local or export SPARK_MASTER="yarn-client"

4. In local mode: ./pig -x spark_local xxx.pig
In yarn-client mode:
export SPARK_HOME=xx;
export SPARK_JAR=hdfs://example.com:8020/xxxx (the hdfs location where you upload the spark-assembly*.jar)
./pig -x spark xxx.pig

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Pig-on-Spark-Design-Doc.pdf
12/Aug/14 13:19
82 kB
Praveen Rachabattuni
Pig-on-Spark-Scope.pdf
30/Jun/15 18:11
549 kB
Mohit Sabharwal

Issue Links

is related to

PIG-3446 Umbrella jira for Pig on Tez

Resolved

PIG-4266 Umbrella jira for unit tests for Spark

Closed

supercedes

PIG-1734 Pig needs a more efficient DAG execution

Resolved

Sub-Tasks

1.	Initial implementation of Pig on Spark	Closed	Praveen Rachabattuni
2.	Initial implementation of unit tests for Pig on Spark	Closed	liyunzhang
3.	Move to Spark 1.x	Closed	Richard Ding
4.	e2e tests for Spark	Closed	Praveen Rachabattuni
5.	Fix classpath error when using pig command with Spark	Resolved	liyunzhang
6.	Make collected group work with Spark	Closed	Praveen Rachabattuni
7.	Make cross join work with Spark	Resolved	Mohit Sabharwal
8.	Implement replicated join in Spark engine	Closed	Mohit Sabharwal
9.	Make skewed join work with Spark	Closed	Praveen Rachabattuni
10.	Make ruby udfs work with Spark	Closed	liyunzhang
11.	Make merge join work with Spark engine	Resolved	Praveen Rachabattuni
12.	Make python udfs work with Spark	Closed	liyunzhang
13.	Make merge-sparse join work with Spark	Closed	Abhishek Agarwal
14.	Make stream work with Spark	Closed	liyunzhang
15.	Copy spark dependencies to lib directory	Closed	Praveen Rachabattuni
16.	Make rank work with Spark	Closed	Carlos Balduz
17.	UDFContext is not initialized in executors when running on Spark cluster	Closed	liyunzhang
18.	Package pig along with dependencies into a fat jar while job submission to Spark cluster	Closed	Praveen Rachabattuni
19.	Avoid packaging spark specific jars into pig fat jar	Closed	Unassigned
20.	Add SparkPlan in spark package	Closed	liyunzhang
21.	Add stats and error reporting for Spark	Closed	Mohit Sabharwal
22.	Move to Spark 1.2	Closed	Mohit Sabharwal
23.	Merge from trunk (1) [Spark Branch]	Closed	Praveen Rachabattuni
24.	Merge from trunk (2) [Spark Branch]	Closed	Praveen Rachabattuni
25.	Upgrade to Spark 1.3	Closed	Mohit Sabharwal
26.	change from "SparkLauncher#physicalToRDD" to "SparkLauncher#sparkPlanToRDD" after using spark plan in SparkLauncher	Closed	liyunzhang
27.	Implement MergeJoin (as regular join) for Spark engine	Closed	Mohit Sabharwal
28.	implement visitSkewedJoin in SparkCompiler	Closed	liyunzhang
29.	Fix the NPE of System.getenv("SPARK_MASTER") in SparkLauncher.java	Closed	liyunzhang
30.	remove unnessary MR plan code generated in SparkLauncher.java	Resolved	liyunzhang
31.	Make ship work with spark	Closed	liyunzhang
32.	PackageConverter hanging in Spark	Patch Available	Carlos Balduz
33.	StackOverflowError in LIMIT operation on Spark	Patch Available	Carlos Balduz
34.	Error when there is a bag inside an RDD	Closed	Carlos Balduz
35.	"pig.output.lazy" not works in spark mode	Closed	liyunzhang
36.	e2e tests for Spark can not work in hadoop env	Closed	liyunzhang
37.	SchemaTupleBackend error when working on a Spark 1.1.0 cluster	Open	Unassigned
38.	Order By error after Group By in Spark	Closed	Unassigned
39.	Limit after sort does not work in spark mode	Closed	liyunzhang
40.	Sort the leaves by SparkOperator.operatorKey in SparkLauncher#sparkOperToRDD	Closed	liyunzhang
41.	Remove redundant code, comments in SparkLauncher	Closed	Praveen Rachabattuni
42.	Add apache license header to all spark package source files	Closed	Praveen Rachabattuni
43.	Enable Secondary key sort feature in spark mode	Closed	liyunzhang
44.	Remove unnecessary store and load when POSplit is encounted	Closed	liyunzhang
45.	SparkOperator should correspond to complete Spark job	Closed	Mohit Sabharwal
46.	Enable local mode tests for Spark engine	Closed	Mohit Sabharwal
47.	Remove repetitive org.apache.pig.test.Util#isSparkExecType	Closed	liyunzhang
48.	OutputConsumerIterator should flush buffered records	Resolved	Mohit Sabharwal
49.	Set CROSS operation parallelism for Spark engine	Closed	Mohit Sabharwal
50.	Fix POGlobalRearrangeSpark copy constructor for Spark engine	Closed	Mohit Sabharwal
51.	Modify the test.output value from "no" to "yes" to show more error message	Closed	liyunzhang
52.	Support custom MR partitioners for Spark engine	Closed	Mohit Sabharwal
53.	Fix unit test failure in TestSecondarySortSpark	Closed	liyunzhang
54.	Pass value to MR Partitioners in Spark engine	Open	Mohit Sabharwal
55.	Use "cogroup" spark api to implement "groupby+secondarysort" case in GlobalRearrangeConverter.java	Closed	liyunzhang
56.	Enable "TestPruneColumn" in spark mode	Closed	Xianda Ke
57.	Use newAPIHadoopRDD instead of newAPIHadoopFile	Closed	Mohit Sabharwal
58.	Cleanup: Rename POConverter to RDDConverter	Closed	Mohit Sabharwal
59.	Move tests under 'test-spark' target	Closed	Mohit Sabharwal
60.	Fix unit test failure in TestCase	Closed	Xianda Ke
61.	Enable "TestMultiQueryLocal" in spark mode	Closed	liyunzhang
62.	Enable "TestMultiQuery" in spark mode	Closed	liyunzhang
63.	Fix unit test failures about TestFRJoinNullValue in spark mode	Closed	liyunzhang
64.	Fix unit test failures about MergeJoinConverter in spark mode	Closed	liyunzhang
65.	Enable "TestNullConstant" unit test in spark mode	Closed	Xianda Ke
66.	Implement Merge CoGroup for Spark engine	Closed	liyunzhang
67.	Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule	Closed	liyunzhang
68.	fix a bug when coping Jar to SparkJob working directory	Closed	Xianda Ke
69.	Enable "TestDefaultDateTimeZone" unit tests in spark mode	Closed	liyunzhang
70.	Enable "TestRank1","TestRank3" unit tests in spark mode	Closed	Xianda Ke
71.	Enable "TestOrcStorage“ unit test in spark mode	Closed	liyunzhang
72.	Fix remaining unit test failures about "TestHBaseStorage" in spark mode	Closed	liyunzhang
73.	Fix unit test failures about TestAssert	Closed	Xianda Ke
74.	Enable "TestLocationInPhysicalPlan" in spark mode	Closed	liyunzhang
75.	Fix null keys join in SkewedJoin in spark mode	Closed	liyunzhang
76.	Fix UT errors of TestPigRunner in Spark mode	Closed	Xianda Ke
77.	Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space	Closed	liyunzhang
78.	Enable Illustrate in spark	In Progress	Jakov Rabinovits
79.	Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate	Closed	liyunzhang
80.	Update hadoop version to enable Spark output statistics	Closed	Xianda Ke
81.	Fix records count issues in output statistics	Closed	Xianda Ke
82.	Support hadoop-like Counter using spark accumulator	Closed	Xianda Ke
83.	Support InputStats in spark mode	Closed	Xianda Ke
84.	Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript	Closed	Xianda Ke
85.	Add Spark Unit Tests for SparkPigStats	Open	Xianda Ke
86.	Fix UT failures in TestPigServerLocal	Closed	Xianda Ke
87.	Enable Pig on Spark to run on Yarn Client mode	Closed	Srikanth Sundarrajan
88.	Operators with multiple predecessors fail under multiquery optimization	Closed	liyunzhang
89.	Enable Pig on Spark to run on Yarn Cluster mode	Resolved	Srikanth Sundarrajan
90.	Class conflicts: Kryo bundled in spark vs kryo bundled with pig	Closed	Srikanth Sundarrajan
91.	Enable dynamic resource allocation/de-allocation on Yarn backends	Closed	Srikanth Sundarrajan
92.	Support combine for spark mode	Closed	Pallavi Rao
93.	Tests in TestCombiner fail due to missing leveldb dependency	Closed	Pallavi Rao
94.	Spark related JARs are not included when importing project via IDE	Closed	Xianda Ke
95.	the value of $SPARK_DIST_CLASSPATH in pig file is invalid	Resolved	liyunzhang
96.	Ensure spark can be run as PIG action in Oozie	Open	Prateek Vaishnav
97.	Fix UT failures in TestScriptLanguage	Closed	Xianda Ke
98.	Ensure GroupBy is optimized for all algebraic Operations	Closed	Pallavi Rao
99.	Refactor SparkLauncher for spark engine	Closed	liyunzhang
100.	Enable "pig.disable.counter“ for spark engine	Closed	liyunzhang
101.	the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine	Resolved	liyunzhang
102.	Implement to collect metric data like getSMMSpillCount() in SparkJobStats	Open	Unassigned
103.	Merge trunk[3] into spark branch	Closed	Pallavi Rao
104.	Collected group doesn't work in some cases	Closed	Xianda Ke
105.	pig.noSplitCombination=true should always be set internally for a merge join	Closed	Xianda Ke
106.	Merge trunk[4] into spark branch	Closed	Pallavi Rao
107.	Last record is missing in STREAM operator	Closed	Xianda Ke
108.	Need upgrade snappy-java.version to 1.1.1.3	Closed	liyunzhang
109.	OutputConsumeIterator can't handle the last buffered tuples for some Operators	Closed	Xianda Ke
110.	Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode	Resolved	liyunzhang
111.	Implement FR join by broadcasting small rdd not making more copys of data	Closed	Nándor Kollár
112.	Fix unit test failure after PIG-4771's patch was checked in	Closed	liyunzhang
113.	The number of records of input file is calculated wrongly in spark mode in multiquery case	Closed	Ádám Szita
114.	Fail to use Javascript UDF in spark yarn client mode	Closed	liyunzhang
115.	Commit changes from last round of review on rb	Closed	liyunzhang
116.	Remove schema tuple reference overhead for replicate join hashmap in POFRJoinSpark	Open	Unassigned
117.	Upgrade spark to 2.0	Closed	liyunzhang
118.	Replace IndexedKey with PigNullableWritable in spark branch	Resolved	Unassigned
119.	exclude jline in spark dependency	Closed	Ádám Szita
120.	Duplicate record key info in GlobalRearrangeConverter#ToGroupKeyValueFunction	Closed	liyunzhang
121.	Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test in spark mode	Open	Unassigned
122.	Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats	Open	Unassigned
123.	Specify the hdfs path directly to spark and avoid the unnecessary download and upload in SparkLauncher.java	Open	Nándor Kollár
124.	Implement auto parallelism for pig on spark	Open	Unassigned

Activity

People

Assignee:: Praveen Rachabattuni

Reporter:: Rohini Palaniswamy

Votes:: 22 Vote for this issue

Watchers:: 60 Start watching this issue

Dates

Created:: 17/Jul/14 14:54

Updated:: 21/Jun/17 09:15

Resolved:: 29/May/17 21:05