Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5541

Enforce sane maximum for BATCH_SIZE

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • Impala 2.10.0
    • Impala 2.11.0
    • Backend
    • ghx-label-8

    Description

      We don't enforce maximums on BATCH_SIZE. Setting a very large value could cause strange behaviour like crashing, hanging, excessive memory usage, spinning etc.

      I think a limit like 65536 (64x the default) would be reasonable but it's worth discussing further.

      Attachments

        Activity

          tianyiwang Tianyi Wang added a comment - - edited

          mmokhtar Your work in IMPALA-2712 looks relevant. Do you have a reasonable limit in mind?

          tianyiwang Tianyi Wang added a comment - - edited mmokhtar Your work in IMPALA-2712 looks relevant. Do you have a reasonable limit in mind?
          tianyiwang Tianyi Wang added a comment - - edited

          Another option other than hardcoding the limit is that allowing setting the limit in impalad flags (e.g. --allowed_query_option_values=batch_size:0..1024). In this way we provide administrators with a method preventing users from crashing the Impala cluster with a bad query option. This kind of limit could be applied on a subset of integer query options.
          Currently each query option implements its own value checking logic. Cleaning up the SetQueryOption() function will make this easier to implement. There are 5 kinds of query options:
          1. Memory size, with or without an upper bound
          2. Integer, with or without a valid range
          3. Enum, may or may not be set with a integer
          4. Boolean.
          4. special cases.
          After cleaning this up, when an option in category 2 is set, we can check the limit specified by command line flags and reject the query if necessary.

          tianyiwang Tianyi Wang added a comment - - edited Another option other than hardcoding the limit is that allowing setting the limit in impalad flags (e.g. --allowed_query_option_values=batch_size:0..1024). In this way we provide administrators with a method preventing users from crashing the Impala cluster with a bad query option. This kind of limit could be applied on a subset of integer query options. Currently each query option implements its own value checking logic. Cleaning up the SetQueryOption() function will make this easier to implement. There are 5 kinds of query options: 1. Memory size, with or without an upper bound 2. Integer, with or without a valid range 3. Enum, may or may not be set with a integer 4. Boolean. 4. special cases. After cleaning this up, when an option in category 2 is set, we can check the limit specified by command line flags and reject the query if necessary.
          tarmstrong Tim Armstrong added a comment -

          tianyiwang A couple of clarifying questions.

          1. Why wouldn't we support limits on memory size similar to numeric values?
          2. Could we extend this to support the enum values and special cases? E.g. by allowing a list of accepted values? Maybe this would be a follow-on extension.
          tarmstrong Tim Armstrong added a comment - tianyiwang A couple of clarifying questions. Why wouldn't we support limits on memory size similar to numeric values? Could we extend this to support the enum values and special cases? E.g. by allowing a list of accepted values? Maybe this would be a follow-on extension.
          tianyiwang Tianyi Wang added a comment -

          tarmstrong
          These all make sense, but there should be some more semantics choices made. And it might break things if users rely on current behaviour:
          1. Currently a memory value -1 or 0 means infinite memory.
          1) For the purpose of cleaning up, do we really need two special values?
          2) For range limit specification, using 0 as infinite does not make sense. If we use some special token, e.g. "INF", and treat 0 as 0 literally, there would be inconsistency.
          2. For enums, there are problems in both cleanup and adding limits. Some enums accept integer input as well, while others don't. We need to discuss whether to unify them and how this work with range and list restrictions.
          3. I forgot boolean options on the list above. I've added it to the list.

          tianyiwang Tianyi Wang added a comment - tarmstrong These all make sense, but there should be some more semantics choices made. And it might break things if users rely on current behaviour: 1. Currently a memory value -1 or 0 means infinite memory. 1) For the purpose of cleaning up, do we really need two special values? 2) For range limit specification, using 0 as infinite does not make sense. If we use some special token, e.g. "INF", and treat 0 as 0 literally, there would be inconsistency. 2. For enums, there are problems in both cleanup and adding limits. Some enums accept integer input as well, while others don't. We need to discuss whether to unify them and how this work with range and list restrictions. 3. I forgot boolean options on the list above. I've added it to the list.
          tarmstrong Tim Armstrong added a comment -

          For the memory case allowing a range with no special handling of 0/-1 is sufficient to express most policies that people would want to use. I think the two common use case x..y where x > 0 to specify a range bounded on both ends and 1..y to express an upper bound. You'd need special handling to express x..INF and it make the behaviour of 0..x a bit weird, but the behaviour is simple at least.

          I don't think we need two special values to express "infinite memory' but I think cleaning it up would be a breaking change. IMO -1 should be the special value because 0 is sometimes a reasonable value, e.g. for scratch_limit. However a lot of older query options limit mem_limit used "0" and have that documented as the way to specify "unlimited".

          I think that's a good point about enums since there's multiple ways to specify the same value. I don't see an immediate need for that so maybe it's ok to ignore that case for now.

          tarmstrong Tim Armstrong added a comment - For the memory case allowing a range with no special handling of 0/-1 is sufficient to express most policies that people would want to use. I think the two common use case x..y where x > 0 to specify a range bounded on both ends and 1..y to express an upper bound. You'd need special handling to express x..INF and it make the behaviour of 0..x a bit weird, but the behaviour is simple at least. I don't think we need two special values to express "infinite memory' but I think cleaning it up would be a breaking change. IMO -1 should be the special value because 0 is sometimes a reasonable value, e.g. for scratch_limit. However a lot of older query options limit mem_limit used "0" and have that documented as the way to specify "unlimited". I think that's a good point about enums since there's multiple ways to specify the same value. I don't see an immediate need for that so maybe it's ok to ignore that case for now.
          tianyiwang Tianyi Wang added a comment -

          IMPALA-5541: Reject BATCH_SIZE greater than 65536

          Setting a very large value could cause strange behaviour like crashing,
          hanging, excessive memory usage, spinning etc. This patch rejects values
          out of the range [0,65536].

          Change-Id: Idd5a2490a73b6915224160d7604b4badc72c1d97
          Reviewed-on: http://gerrit.cloudera.org:8080/8419
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Impala Public Jenkins

          tianyiwang Tianyi Wang added a comment - IMPALA-5541 : Reject BATCH_SIZE greater than 65536 Setting a very large value could cause strange behaviour like crashing, hanging, excessive memory usage, spinning etc. This patch rejects values out of the range [0,65536] . Change-Id: Idd5a2490a73b6915224160d7604b4badc72c1d97 Reviewed-on: http://gerrit.cloudera.org:8080/8419 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
          jrussell John Russell added a comment -

          Is it possible to actually set a limit of 0? We document that 0 represents the default value of 1024:

          https://impala.incubator.apache.org/docs/build/html/topics/impala_batch_size.html

          jrussell John Russell added a comment - Is it possible to actually set a limit of 0? We document that 0 represents the default value of 1024: https://impala.incubator.apache.org/docs/build/html/topics/impala_batch_size.html
          tarmstrong Tim Armstrong added a comment -

          jrussell yes that's correct, batch_size of 0 is an accepted value but it has a special meaning.

          tarmstrong Tim Armstrong added a comment - jrussell yes that's correct, batch_size of 0 is an accepted value but it has a special meaning.
          jrussell John Russell added a comment -

          I see in the code review, the number 65536 is there. Just checking that the very highest value does not cause some strangeness due to exceeding by 1 the max for a 16-bit unsigned int, i.e. it's 0xffff + 1. (I would have guessed the maximum of a "64K" range would effectively be 65535.)

          jrussell John Russell added a comment - I see in the code review, the number 65536 is there. Just checking that the very highest value does not cause some strangeness due to exceeding by 1 the max for a 16-bit unsigned int, i.e. it's 0xffff + 1. (I would have guessed the maximum of a "64K" range would effectively be 65535.)
          tianyiwang Tianyi Wang added a comment -

          johnruss Setting it to 65536 should work without problem.

          tianyiwang Tianyi Wang added a comment - johnruss Setting it to 65536 should work without problem.

          People

            tianyiwang Tianyi Wang
            tarmstrong Tim Armstrong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: