I think your feedback raises some fair questions, but I have some reasons for disagreeing:
wrt schema has fewer fields than actual:
this is a common case for me b/c pig doesn't allow specification of *-tuple i.e. all rows of data will have the same number of (int) elements, but it's not known how many. This is a week area of pig in general imho. If there is only 1 element in the tuple it can be seen to infer some type information for the remaining rows. (at least i think this is how 'tuple(int)' shows up). I think that when there are more than 1 columns in a tuple, then it's not a generic tuple, then i can see an error being appropriate. but for 1-tuples i appreciate the flexibility of using it for the type and writing udfs that accept tuples of arbitrary dimension, even if the args-to-function stuff is to simplistic to apply in this scenario it's easy enough to write useful udfs that utilize tuple dimension flexibility.
I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. This may be where reasonable people disagree, but i think null rather than a tuple reflecting the returned data is less expected.
The whole reporting of 'schema != data' could be improved tho. I am not sure of the best way to reflect that anything "grey/WARN" is happening. It seems liking logging 1 line per encountered edge case is major overkill, and prone to generate huge log output. We could count each WARN scenario and log/counter that information to give a more succinct description of execution behavior that a simple user can fix, and an advanced user can ignore judiciously. Possibly more specific counters and only 1 warn per type per execution.
Pig schemas are often so.. imprecise, that i think best effort coercion is useful, but i think a fine compromise would be to support only a specific set of conversions that would be a subset of this patch, but perform the others b/c they are mostly intuitive and useful, but a WARN will be generated when executed if we think it's too esoteric. We may draw lines in slightly different places, but i tried to cover a fair number of cases in the test code, which is think is a fairly survey of expected coercions.
wrt JU.asBag, i think auto-tupling is a must. This is one of the most common mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a very common refrain, and just silly 99.9% of the time. Plus it's a bunch of extra unnecessary objects that one must create, and causes a bit slower execution for simple udfs.
I'd have to re-read the code again to examine the edge cases. I do recall the disambiguation for embedded bags being a pain to write and describe. Documentation being the remaining concern. That said, i think it does something reasonable and still executes faster than existing rigid code. Also in the code is a decent synopsis of the disambiguations that are intended.
wrt skipping nulls: can you cite the line number? do you mean skipping null bags? or null element/tuples when creating a bag? This might just be me not understanding something properly. I thought bags didn't have null tuples, just tuples with null elements?
wrt various types:
jython is fully capable of returning any jvm type. so that means anything really.
I decided to cover the collections classes, lang classes, base types, and PY* classes.
Jython is nice in that many classes implement the collections ifaces, but not always as efficiently as using the python classes directly.
this is common in python/jython of course. not in udfs as of yet... b/c it wasn't allowed. But i began doing it pretty quickly once it was possible.