Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.10.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
    • Epic Color:
      ghx-label-8

      Description

      We don't enforce maximums on BATCH_SIZE. Setting a very large value could cause strange behaviour like crashing, hanging, excessive memory usage, spinning etc.

      I think a limit like 65536 (64x the default) would be reasonable but it's worth discussing further.

        Activity

        Hide
        tianyiwang Tianyi Wang added a comment - - edited

        Mostafa Mokhtar Your work in IMPALA-2712 looks relevant. Do you have a reasonable limit in mind?

        Show
        tianyiwang Tianyi Wang added a comment - - edited Mostafa Mokhtar Your work in IMPALA-2712 looks relevant. Do you have a reasonable limit in mind?
        Hide
        tianyiwang Tianyi Wang added a comment - - edited

        Another option other than hardcoding the limit is that allowing setting the limit in impalad flags (e.g. --allowed_query_option_values=batch_size:0..1024). In this way we provide administrators with a method preventing users from crashing the Impala cluster with a bad query option. This kind of limit could be applied on a subset of integer query options.
        Currently each query option implements its own value checking logic. Cleaning up the SetQueryOption() function will make this easier to implement. There are 5 kinds of query options:
        1. Memory size, with or without an upper bound
        2. Integer, with or without a valid range
        3. Enum, may or may not be set with a integer
        4. Boolean.
        4. special cases.
        After cleaning this up, when an option in category 2 is set, we can check the limit specified by command line flags and reject the query if necessary.

        Show
        tianyiwang Tianyi Wang added a comment - - edited Another option other than hardcoding the limit is that allowing setting the limit in impalad flags (e.g. --allowed_query_option_values=batch_size:0..1024). In this way we provide administrators with a method preventing users from crashing the Impala cluster with a bad query option. This kind of limit could be applied on a subset of integer query options. Currently each query option implements its own value checking logic. Cleaning up the SetQueryOption() function will make this easier to implement. There are 5 kinds of query options: 1. Memory size, with or without an upper bound 2. Integer, with or without a valid range 3. Enum, may or may not be set with a integer 4. Boolean. 4. special cases. After cleaning this up, when an option in category 2 is set, we can check the limit specified by command line flags and reject the query if necessary.
        Hide
        tarmstrong Tim Armstrong added a comment -

        Tianyi Wang A couple of clarifying questions.

        1. Why wouldn't we support limits on memory size similar to numeric values?
        2. Could we extend this to support the enum values and special cases? E.g. by allowing a list of accepted values? Maybe this would be a follow-on extension.
        Show
        tarmstrong Tim Armstrong added a comment - Tianyi Wang A couple of clarifying questions. Why wouldn't we support limits on memory size similar to numeric values? Could we extend this to support the enum values and special cases? E.g. by allowing a list of accepted values? Maybe this would be a follow-on extension.
        Hide
        tianyiwang Tianyi Wang added a comment -

        Tim Armstrong
        These all make sense, but there should be some more semantics choices made. And it might break things if users rely on current behaviour:
        1. Currently a memory value -1 or 0 means infinite memory.
        1) For the purpose of cleaning up, do we really need two special values?
        2) For range limit specification, using 0 as infinite does not make sense. If we use some special token, e.g. "INF", and treat 0 as 0 literally, there would be inconsistency.
        2. For enums, there are problems in both cleanup and adding limits. Some enums accept integer input as well, while others don't. We need to discuss whether to unify them and how this work with range and list restrictions.
        3. I forgot boolean options on the list above. I've added it to the list.

        Show
        tianyiwang Tianyi Wang added a comment - Tim Armstrong These all make sense, but there should be some more semantics choices made. And it might break things if users rely on current behaviour: 1. Currently a memory value -1 or 0 means infinite memory. 1) For the purpose of cleaning up, do we really need two special values? 2) For range limit specification, using 0 as infinite does not make sense. If we use some special token, e.g. "INF", and treat 0 as 0 literally, there would be inconsistency. 2. For enums, there are problems in both cleanup and adding limits. Some enums accept integer input as well, while others don't. We need to discuss whether to unify them and how this work with range and list restrictions. 3. I forgot boolean options on the list above. I've added it to the list.
        Hide
        tarmstrong Tim Armstrong added a comment -

        For the memory case allowing a range with no special handling of 0/-1 is sufficient to express most policies that people would want to use. I think the two common use case x..y where x > 0 to specify a range bounded on both ends and 1..y to express an upper bound. You'd need special handling to express x..INF and it make the behaviour of 0..x a bit weird, but the behaviour is simple at least.

        I don't think we need two special values to express "infinite memory' but I think cleaning it up would be a breaking change. IMO -1 should be the special value because 0 is sometimes a reasonable value, e.g. for scratch_limit. However a lot of older query options limit mem_limit used "0" and have that documented as the way to specify "unlimited".

        I think that's a good point about enums since there's multiple ways to specify the same value. I don't see an immediate need for that so maybe it's ok to ignore that case for now.

        Show
        tarmstrong Tim Armstrong added a comment - For the memory case allowing a range with no special handling of 0/-1 is sufficient to express most policies that people would want to use. I think the two common use case x..y where x > 0 to specify a range bounded on both ends and 1..y to express an upper bound. You'd need special handling to express x..INF and it make the behaviour of 0..x a bit weird, but the behaviour is simple at least. I don't think we need two special values to express "infinite memory' but I think cleaning it up would be a breaking change. IMO -1 should be the special value because 0 is sometimes a reasonable value, e.g. for scratch_limit. However a lot of older query options limit mem_limit used "0" and have that documented as the way to specify "unlimited". I think that's a good point about enums since there's multiple ways to specify the same value. I don't see an immediate need for that so maybe it's ok to ignore that case for now.

          People

          • Assignee:
            tianyiwang Tianyi Wang
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development