Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 3.11.x
    • Component/s: None
    • Labels:
      None

      Description

      CASSANDRA-9402 introduced a sandbox using a thread-pool to enforce security constraints and to detect "amok UDFs" - i.e. UDFs that essentially never return (e.g. while (true).

      Currently the safest way to react on such an "amok UDF" is to fail-fast - to stop the C* daemon since stopping a thread (in Java) is just no solution.

      CASSANDRA-9890 introduced further protection by inspecting the byte-code. The same mechanism can also be used to manipulate the Java-UDF byte-code.

      By manipulating the byte-code I mean to add regular "is-amok-UDF" checks in the compiled code.

      EDIT: These "is-amok-UDF" checks would also work for UNFENCED Java-UDFs.

        Issue Links

          Activity

          Hide
          snazy Robert Stupp added a comment -

          Rebased the patch and triggered CI:

          trunk branch testall dtest
          Show
          snazy Robert Stupp added a comment - Rebased the patch and triggered CI: trunk branch testall dtest
          Hide
          snazy Robert Stupp added a comment -

          Cancelling patch for the moment - needs a thorough rebase.

          Show
          snazy Robert Stupp added a comment - Cancelling patch for the moment - needs a thorough rebase.
          Hide
          snazy Robert Stupp added a comment -

          Rebased the branch and triggered CI (links in the first comment)

          Show
          snazy Robert Stupp added a comment - Rebased the branch and triggered CI (links in the first comment)
          Hide
          snazy Robert Stupp added a comment -

          Updated the branch with the following changes:

          • rename "timeout" to "cpu time" to make clear that the values relate to CPU time and not wall clock
          • let Java UDFs run in the calling thread and updated security-manager accordingly
          • byte code inspection now checks class names against general UDF black/white lists
          • refactorings to (hopefully) simplify code

          I think the changes to Java UDFs are fine WRT class loading and security manager.

          Still unsure about timeouts related wall-clock and (remaining) request-time. In order to not issue false-positive warnings and errors caused by "badly timed GCs" or an overloaded system, the timeouts need to be quite high.

          Triggered cassci runs, too.

          PS: while fixing the branches for another ticket, I ... up the branch for this ticket (both local and github) and the only backup was in IntelliJ's history. Sorry, that the branch now appears squashed.

          Show
          snazy Robert Stupp added a comment - Updated the branch with the following changes: rename "timeout" to "cpu time" to make clear that the values relate to CPU time and not wall clock let Java UDFs run in the calling thread and updated security-manager accordingly byte code inspection now checks class names against general UDF black/white lists refactorings to (hopefully) simplify code I think the changes to Java UDFs are fine WRT class loading and security manager. Still unsure about timeouts related wall-clock and (remaining) request-time. In order to not issue false-positive warnings and errors caused by "badly timed GCs" or an overloaded system, the timeouts need to be quite high. Triggered cassci runs, too. PS: while fixing the branches for another ticket, I ... up the branch for this ticket (both local and github) and the only backup was in IntelliJ's history. Sorry, that the branch now appears squashed.
          Hide
          snazy Robert Stupp added a comment -

          Will also think about your suggestion to also use the wall clock for timeout handling and maybe also consider the (remaining) request-timeout, too.

          Show
          snazy Robert Stupp added a comment - Will also think about your suggestion to also use the wall clock for timeout handling and maybe also consider the (remaining) request-timeout, too.
          Hide
          snazy Robert Stupp added a comment -

          Are we going to do this after CASSANDRA-10395?

          Yes, that would be the order. But I’d like to prepare both together since 10395 removes “client timeout warning” functionality that this ticket can bring back. In other words: make 10395 RTC and base this on 10395. WDYT?

          Do we allow UDFs in writes?

          Yes.

          can mark the UDFs as deterministic/non-deterministic

          No - that’s been removed after that blog post. We might need to bring that back with functional indexes (CASSANDRA-7458) as non-deterministic functions would be bad for that.

          You’re probably right

          reads do UDFs only run at the coordinator?

          Yes.

          Checking metrics every 16 times is a little bit too often for most loop iterations. Maybe make that a property?

          Good point. Added cassandra.java_udf_check_interval for that - now defaults to 1000 instead of 16.

          Renamed JavaUDFByteCodeVerifier.verify to verifyAndInstrument.

          Labels are inserted nearly everywhere and are required for loops, conditional branches and all that stuff.
          You’re right - we should measure the performance impact at least in a micro benchmark. I think all we need to know is the performance impact for the UDF itself as that’s basically just added. WDYT?

          Rebased the branch and pushed it.

          Show
          snazy Robert Stupp added a comment - Are we going to do this after CASSANDRA-10395 ? Yes, that would be the order. But I’d like to prepare both together since 10395 removes “client timeout warning” functionality that this ticket can bring back. In other words: make 10395 RTC and base this on 10395. WDYT? Do we allow UDFs in writes? Yes. can mark the UDFs as deterministic/non-deterministic No - that’s been removed after that blog post. We might need to bring that back with functional indexes ( CASSANDRA-7458 ) as non-deterministic functions would be bad for that. You’re probably right reads do UDFs only run at the coordinator? Yes. Checking metrics every 16 times is a little bit too often for most loop iterations. Maybe make that a property? Good point. Added cassandra.java_udf_check_interval for that - now defaults to 1000 instead of 16. Renamed JavaUDFByteCodeVerifier.verify to verifyAndInstrument . Labels are inserted nearly everywhere and are required for loops, conditional branches and all that stuff. You’re right - we should measure the performance impact at least in a micro benchmark. I think all we need to know is the performance impact for the UDF itself as that’s basically just added. WDYT? Rebased the branch and pushed it.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Thanks Robert. Are we going to do this after CASSANDRA-10395?

          I know this isn't part of this issue, but the whitelist and blacklist as constants seem a little problematic. Just from a deployment and maintenance perspective allowing people to manipulate them (mechanism not policy) as well as warning for some things rather then straight up blocking them seems appropriate. If one thing we want to let people do is leverage existing code inside UDFs then we don't want to be too inflexible. Definitely not something to do as part of this, but I am broaching the subject.

          Do we allow UDFs in writes? I read the blog post and it seems like you can mark the UDFs as deterministic/non-deterministic. Part of paving the path for determinism is disallowing currentTimeMillis() and nanoTime(). If they want time they should pass them to the UDF as a parameter when the invoke the query. The same could be said for random number generation. For deterministic UDFs you might be much more strict or have different warning/error policies for calling different functions. Doing DNS resolution from a UDF isn't technically wrong if they have good caching and timeouts in place (or we provide that for them).

          For reads do UDFs only run at the coordinator or remotely at replicas before results are returned? I suppose it doesn't really matter since the pain when versions or configurations have different whitelist/blacklist settings is the same.

          Checking metrics every 16 times is a little bit too often for most loop iterations. Maybe make that a property? The check is not cheap and represents at least a hundred nanoseconds of work possibly more. How often will people actually have loops to iterate through in UDFs? I imagine if they tear apart a collection or a JSON doc it will be pretty heavyweight stuff.

          This isn't just verifying anymore, it's verifyAndInstrument.

          I am not completely familiar with what the compiler does when emitting the labels for bytecode. Does it have a convention to insert in a bunch of places? Inserting a check at all the labels seems a bit excessive, but it's just performance so rather then guess as to how it works let's just measure the performance in a meaningful way. Do we have a benchmark workload we could run in cstar that would test UDF performance? Maybe one for a lightweight UDF and another for the heaviest weight UDF we think we will come across? For the lightweight UDF we may want to test an expression that invokes several UDFs per query so that it magnifies the transaction cost of starting a UDF.

          This is just my first pass reaction. I need to read up on the libraries you are using to do byte code manipulation and how labels work.

          Show
          aweisberg Ariel Weisberg added a comment - Thanks Robert. Are we going to do this after CASSANDRA-10395 ? I know this isn't part of this issue, but the whitelist and blacklist as constants seem a little problematic. Just from a deployment and maintenance perspective allowing people to manipulate them (mechanism not policy) as well as warning for some things rather then straight up blocking them seems appropriate. If one thing we want to let people do is leverage existing code inside UDFs then we don't want to be too inflexible. Definitely not something to do as part of this, but I am broaching the subject. Do we allow UDFs in writes? I read the blog post and it seems like you can mark the UDFs as deterministic/non-deterministic. Part of paving the path for determinism is disallowing currentTimeMillis() and nanoTime(). If they want time they should pass them to the UDF as a parameter when the invoke the query. The same could be said for random number generation. For deterministic UDFs you might be much more strict or have different warning/error policies for calling different functions. Doing DNS resolution from a UDF isn't technically wrong if they have good caching and timeouts in place (or we provide that for them). For reads do UDFs only run at the coordinator or remotely at replicas before results are returned? I suppose it doesn't really matter since the pain when versions or configurations have different whitelist/blacklist settings is the same. Checking metrics every 16 times is a little bit too often for most loop iterations. Maybe make that a property? The check is not cheap and represents at least a hundred nanoseconds of work possibly more. How often will people actually have loops to iterate through in UDFs? I imagine if they tear apart a collection or a JSON doc it will be pretty heavyweight stuff. This isn't just verifying anymore, it's verifyAndInstrument. I am not completely familiar with what the compiler does when emitting the labels for bytecode. Does it have a convention to insert in a bunch of places? Inserting a check at all the labels seems a bit excessive, but it's just performance so rather then guess as to how it works let's just measure the performance in a meaningful way. Do we have a benchmark workload we could run in cstar that would test UDF performance? Maybe one for a lightweight UDF and another for the heaviest weight UDF we think we will come across? For the lightweight UDF we may want to test an expression that invokes several UDFs per query so that it magnifies the transaction cost of starting a UDF. This is just my first pass reaction. I need to read up on the libraries you are using to do byte code manipulation and how labels work.
          Hide
          snazy Robert Stupp added a comment -

          Rebased the branch and asked cassci for CI results.

          Show
          snazy Robert Stupp added a comment - Rebased the branch and asked cassci for CI results .
          Hide
          aweisberg Ariel Weisberg added a comment -

          The links don't work and I don't see a branch with the ticket name in your github fork?

          Show
          aweisberg Ariel Weisberg added a comment - The links don't work and I don't see a branch with the ticket name in your github fork?
          Hide
          snazy Robert Stupp added a comment -

          Background information (byte-code level).
          All loops, branches, cases, jumps, gotos etc use labels in the byte code. The trick that this patch uses is to inject a check-call after each label.
          The check itself is quite cheap.

          Example "amok UDF code":

                  while (true)
                  {
                      if (System.currentTimeMillis() % 12345 == 99999)
                          break;
                  }
                  return null;
          

          This would compile to this byte-code (FRAME and LINENUMBER byte-code omitted for readability):

            L0
              INVOKESTATIC java/lang/System.currentTimeMillis ()J
              LDC 12345
              LREM
              LDC 99999
              LCMP
              IFNE L0
             L1
              GOTO L2
             L2
              ACONST_NULL
              ARETURN
             L3
          ...
          

          The patch injects the following byte-code after each label:

              INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z
              IFEQ continueLabel
              ACONST_NULL
              ARETURN
          

          resulting in:

            L0
              INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z
              IFEQ LCONTINUE1
              ACONST_NULL
              ARETURN
            LCONTINUE1
              INVOKESTATIC java/lang/System.currentTimeMillis ()J
              LDC 12345
              LREM
              LDC 99999
              LCMP
              IFNE L0
             L1
              INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z
              IFEQ LCONTINUE2
              ACONST_NULL
              ARETURN
            LCONTINUE2
              GOTO L2
             L2
              INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z
              IFEQ LCONTINUE2
              ACONST_NULL
              ARETURN
            LCONTINUE2
              ACONST_NULL
              ARETURN
             L3
          ...
          
          Show
          snazy Robert Stupp added a comment - Background information (byte-code level). All loops, branches, cases, jumps, gotos etc use labels in the byte code. The trick that this patch uses is to inject a check-call after each label. The check itself is quite cheap. Example "amok UDF code": while ( true ) { if ( System .currentTimeMillis() % 12345 == 99999) break ; } return null ; This would compile to this byte-code ( FRAME and LINENUMBER byte-code omitted for readability): L0 INVOKESTATIC java/lang/System.currentTimeMillis ()J LDC 12345 LREM LDC 99999 LCMP IFNE L0 L1 GOTO L2 L2 ACONST_NULL ARETURN L3 ... The patch injects the following byte-code after each label: INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z IFEQ continueLabel ACONST_NULL ARETURN resulting in: L0 INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z IFEQ LCONTINUE1 ACONST_NULL ARETURN LCONTINUE1 INVOKESTATIC java/lang/System.currentTimeMillis ()J LDC 12345 LREM LDC 99999 LCMP IFNE L0 L1 INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z IFEQ LCONTINUE2 ACONST_NULL ARETURN LCONTINUE2 GOTO L2 L2 INVOKESTATIC org/apache/cassandra/cql3/functions/JavaUDF.checkTimeout ()Z IFEQ LCONTINUE2 ACONST_NULL ARETURN LCONTINUE2 ACONST_NULL ARETURN L3 ...
          Hide
          snazy Robert Stupp added a comment -

          Cassci links (should appear soon):
          testall
          dtest

          Show
          snazy Robert Stupp added a comment - Cassci links (should appear soon): testall dtest

            People

            • Assignee:
              snazy Robert Stupp
              Reporter:
              snazy Robert Stupp
              Reviewer:
              Tyler Hobbs
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development