[PIG-2417] Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Patch Info:

Patch Available
Hadoop Flags:

Reviewed

Description

The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages with no JVM implementation or a limited JVM implementation. The initial proposal is outlined here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.

In order to implement this we need new syntax to distinguish a streaming UDF from an embedded JVM UDF. I'd propose something like the following (although I'm not sure 'language' is the best term to be using):

define my_streaming_udfs language('python') ship('my_streaming_udfs.py')

We'll also need a language-specific controller script that gets shipped to the cluster which is responsible for reading the input stream, deserializing the input data, passing it to the user written script, serializing that script output, and writing that to the output stream.

Finally, we'll need to add a StreamingUDF class that extends evalFunc. This class will likely share some of the existing code in POStream and ExecutableManager (where it make sense to pull out shared code) to stream data to/from the controller script.

One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream operator directly. This would involve inserting the POStream operator instead of the POUserFunc operator whenever we encountered a streaming UDF while building the physical plan. This approach seemed problematic because there would need to be a lot of changes in order to support POStream in all of the places we want to be able use UDFs (For example - to operate on a single field inside of a for each statement).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

streaming.patch
16/Dec/11 00:27
44 kB
Jeremy Karn
streaming2.patch
23/Dec/11 13:14
115 kB
Jeremy Karn
streaming3.patch
10/Jan/12 01:44
99 kB
Jeremy Karn
PIG-2417-4.patch
27/Feb/13 21:59
109 kB
Jeremy Karn
PIG-2417-5.patch
23/Aug/13 20:05
154 kB
Jeremy Karn
PIG-2417-6.patch
04/Sep/13 19:06
152 kB
Jeremy Karn
PIG-2417-7.patch
05/Sep/13 20:12
172 kB
Jeremy Karn
PIG-2417-8.patch
07/Sep/13 15:33
153 kB
Jeremy Karn
PIG-2417-e2e.patch
09/Sep/13 20:13
15 kB
Jeremy Karn
PIG-2417-9.patch
18/Sep/13 21:56
169 kB
Jeremy Karn
PIG-2417-9-1.patch
19/Sep/13 23:35
2 kB
Daniel Dai
PIG-2417-9-2.patch
20/Sep/13 01:12
32 kB
Jeremy Karn
PIG-2417-unicode.patch
25/Sep/13 19:51
1 kB
Jeremy Karn

Issue Links

requires

PIG-3255 Avoid extra byte array copies in streaming

Closed

Activity

People

Assignee:: Jeremy Karn

Reporter:: Jeremy Karn

Votes:: 5 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 09/Dec/11 14:07

Updated:: 10/Feb/14 22:01

Resolved:: 23/Sep/13 15:58