I don't think there is any measurable overhead to the reflection mechanism in the example I provided. The objects are allocated "a few" times due to the schema interrogation logic of pig (something that might deserve an entire other bug thread of discussion, as i have no idea why X copies of a UDF have to be allocated for this).
When it comes time to run (i.e. where it really counts), there is a single invocation of the factory pattern followed by "huge" (data set derived) number of calls to that function. The UDF that is called is fully built an fully initialized with final variables etc, facilitating maximal streamlined execution.
There are certainly things about the approach i took, but language selection overhead is not one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised.
A secondary point to the whole idea of needing some script language code other than, say BSF or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner. Each engine unfortunately consumes and produces objects in its own object model. If either of these frameworks had bothered to mandate converting input/output to java.util things would at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance concerns.
But, even for rudimentary correctness, type conversion must be implemented for each, at which point, a wrapping pattern that selects an appropriate function factory is a necessary pattern anyway.
Orthogonal to the above point: The idea of trying to support multiple script languages vs. a few. I am personally not of the same mind as you guys i guess.
I would make the sacrifice if the ability to support multiple languages was actually that hard, or had an actual serious performance cost.
I just don't think those two issues are real.
The performance costs come from the individual scripting engine features with respect to byte-code compliation, function referencing, string manipulation, execution caching etc., and their type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought about this, as have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving forward that offers integration of PIG with any script language. It (importantly) offers to put those languages in PIG instead of the other way around, and it allows for multiple interpreter contexts and even multiple languages.
I'll quote Arnab's quick description here:
This is identical to the commandline streaming syntax, and follows gracefully in the style of the "ship" and "cache" keywords.
DEFINE JSBlock `
return a.split(" ");
Note the use of backticks is consistent with the current syntax, and is unlikely to occur in common scripts, so it saves us the escaping. Also it allows newlines in the code.
The goal is to create namespaces – you can now call your function as "JSBlock.split(a)". This allows us to have multiple functions in one block.
This idea, coupled with the ability to register files and directories directly (e.g. register foo.py provides the ability to load code into an arbitrary namespace/interpreter-scope, load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause it to execute in the same was as:
define spig_split org.apache.pig.scripting.Eval('jython','split','b:
i don't see that as particularly difficult modification of the function rationalization logic of pig. Actually, i think it's a general improvement as it cuts down on object allocations.
In the event that this methodology is adopted, you are then still free to write projects that stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well with multiple script engines.
I see it as quite achievable to support any given language with near zero overhead above the lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be mixed together, even within the same script
I think that, overall this is highly preferable to a single or otherwise finite language situation (though i advocate possibly auto-supporting jython/jruby)