Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
Impala 2.10.0
-
None
-
ghx-label-7
Description
The Impala docs currently state that strings have a maximum size of 32kb. http://impala.apache.org/docs/build/html/topics/impala_string.html
This is misleading and causes confusion. We should clarify this in a way that makes it clear that 32kb is not the actual limit, but also makes it clear that performance and memory consumption will degrade with larger strings.
Here's what the behaviour is. Unsure the best way to succinctly characterise it.
- We expect that queries operating on 32KB strings will work reliably and not hit significant performance or memory problems (unless you have very complex queries, very many columns, etc).
- There is an absolute hard limit of ~ 2GB on strings.
- Memory consumption of queries will grow as string sizes increase and the probability of hitting "memory limit exceeded" will increase.
- Performance of queries will decrease as strings get larger.
- The row size (i.e. total size of all string and other columns) is limited in various places by the implementation of various operators.
- Rows coming from the right side of any hash join
- Rows coming from either side of a spilling hash join
- Rows being sorted
- Rows in a grouping aggregation.
In Impala <= 2.9 the default limit in those places is 8MB.
With the IMPALA-3200 changes that default number may decrease significantly, to 2MB or less, but there will be a new query option, something like MAX_ROW_SIZE, that will make this row size limit configurable on a per-query basis.