Ok, some fresh thoughts rolling in after sleeping on this.
Why do we have this foreach in the first place? It's inserted to achieve the following goals:
- pad nulls (in PIG-2824, Jie saw perf problems from that, and I suggested we get rid of the foreach altogether, getting POLoad to do the null padding instead).
- coerce tuples generated by the loader into schemas specified in the "load as.." statement
- drop unneeded columns
(please let me know if this list is incomplete)
For padding nulls, I believe we can achieve the same effect much more cheaply, and without the side effect that's biting us here, by making basic modifications to POLoad.
For coercing into schemas, we can do the same thing – copy all the fields from the incoming tuple (including excess ones), and only convert the ones we know something about. This can also be done directly in POLoad, and only be triggered if the loader doesn't already tell us what the schema is it's returning, or the schemas don't match type-wise.
This leaves dropping columns. Since in that case the whole point is to not carry along unwanted columns, this use case is clearly in conflict with the way the PoissonSampleLoader wants to work, by inserting extra columns and sneaking them through to the UDF linked to it. Moreover, if we go the route of putting the plan between load and skewed join between the sample loader and the GetMemNumRows UDF, other things may also break the sampling – for example, filters that happen to filter out the specially marked tuples, by accident. This is telling us that messing with the tuples PSL returns is problematic. What if instead we created a UDF that was fed all the tuples from a regular loader, with the rest of the pipeline that gets inserted, but was able to signal to its consumers when it's done – thus effectively recreating PoissonSampleLoader's functionality in addition to GetMemNumRows ? It would output sample tuples or nulls, and we can add a null filter right above it. I believe that gives us everything we are looking for and simplifies the pipeline a fair bit. We'd have to add capability for UDFs to early-terminate, of course. That's already been done for Accumulative UDFs in
PIG-2066 and I think should be straightforward to do for regular UDFs.