Thanks Robert. Are we going to do this after
I know this isn't part of this issue, but the whitelist and blacklist as constants seem a little problematic. Just from a deployment and maintenance perspective allowing people to manipulate them (mechanism not policy) as well as warning for some things rather then straight up blocking them seems appropriate. If one thing we want to let people do is leverage existing code inside UDFs then we don't want to be too inflexible. Definitely not something to do as part of this, but I am broaching the subject.
Do we allow UDFs in writes? I read the blog post and it seems like you can mark the UDFs as deterministic/non-deterministic. Part of paving the path for determinism is disallowing currentTimeMillis() and nanoTime(). If they want time they should pass them to the UDF as a parameter when the invoke the query. The same could be said for random number generation. For deterministic UDFs you might be much more strict or have different warning/error policies for calling different functions. Doing DNS resolution from a UDF isn't technically wrong if they have good caching and timeouts in place (or we provide that for them).
For reads do UDFs only run at the coordinator or remotely at replicas before results are returned? I suppose it doesn't really matter since the pain when versions or configurations have different whitelist/blacklist settings is the same.
Checking metrics every 16 times is a little bit too often for most loop iterations. Maybe make that a property? The check is not cheap and represents at least a hundred nanoseconds of work possibly more. How often will people actually have loops to iterate through in UDFs? I imagine if they tear apart a collection or a JSON doc it will be pretty heavyweight stuff.
This isn't just verifying anymore, it's verifyAndInstrument.
I am not completely familiar with what the compiler does when emitting the labels for bytecode. Does it have a convention to insert in a bunch of places? Inserting a check at all the labels seems a bit excessive, but it's just performance so rather then guess as to how it works let's just measure the performance in a meaningful way. Do we have a benchmark workload we could run in cstar that would test UDF performance? Maybe one for a lightweight UDF and another for the heaviest weight UDF we think we will come across? For the lightweight UDF we may want to test an expression that invokes several UDFs per query so that it magnifies the transaction cost of starting a UDF.
This is just my first pass reaction. I need to read up on the libraries you are using to do byte code manipulation and how labels work.