Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
ghx-label-7
Description
DATE values describe a particular year/month/day, in the form YYYY-MM-DD. For example, DATE '2013-01-01'. Date types do not have a time of day component. The range of values supported for the Date type is 0000-01-01 to 9999-12-31.
The initial DATE type support should incluide the following changes:
- new internal type
- casting between DATE and other types
- codegen infrastructure for expression evaluation
- "IS [NOT] NULL" and "[NOT] IN" predicates
- common comparison operators
- BETWEEN operator
- conditional functions
- infrastructure changes for builtin scalar functions.
- some built-in functions: aggregate functions, analytical functions, math functions.
- support partitioning.
- text support only.
These items are tightly coupled and it makes sense to implement them in one change-set.
Attachments
Issue Links
- causes
-
IMPALA-10086 SqlCastException when comparing char with varchar
- Resolved
- is related to
-
IMPALA-7374 Impala Doc: Doc DATE type
- Closed
Activity
Commit 9f05cf79fa385ee6a5245ec7cb9ec1b9302c543d in impala's branch refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9f05cf79f ]
IMPALA-10086: Implicit cast comparing char and varchar
Until IMPALA-7368, Impala allowed comparing char and varchar slots as in
select * from functional.chars_tiny where cs = vc;
IMPALA-7368 added DATE type support, and as part of that changed
function call resolution to use a fit function based on the number of
arguments that match the call types. Previously the comparison above
would take the first matching function, which happened to be equality
between STRING and STRING; CHAR and VARCHAR can both be implicitly cast
to STRING, so this function worked. With the new function resolution,
the best fit is equality between VARCHAR and VARCHAR, however implicit
casting to VARCHAR from CHAR wasn't allowed.
The behavior before IMPALA-7368 was somewhat accidental; it depended on
the order that builtin EQ functions are added via
BinaryPredicate.initBuiltins -> Type.getSupportedTypes. Supported types
happened to be ordered with STRING preceding VARCHAR and CHAR. The fit
function makes sense and changing its behavior may have other
consequences; it also makes sense that CHAR should be castable to
VARCHAR.
This change allows implicit cast between matching types. Functionally it
only changes how we handle char/varchar comparison with wildcard
char/varchar, because decimals are handled before checking for matching
types and other type matching is the same as equals. It now allows
casting to a compatible type when it is a char or varchar and the target
type is a wildcard version of the same.
Does not attempt to address differences from CHAR padding (IMPALA-1652).
Testing:
- Adds tests covering cast comparison and other implicit conversions.
- Passed exhaustive test run.
Change-Id: Ib89d0a391bc8f2152ecd9151c8872a01ba19c436
Reviewed-on: http://gerrit.cloudera.org:8080/20425
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Commit 684a54a89e1752534dc713b05185bae1a6b69ba8 in impala's branch refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=684a54a ]
IMPALA-7368: Change supported year range for DATE values to 1..9999
Before this patch the supported year range for DATE type started with
year 0. This contradicts the ANSI SQL standard that defines the valid
DATE value range to be 0001-01-01 to 9999-12-31.
Change-Id: Iefdf1c036834763f52d44d0c39a25a1f04e41e07
Reviewed-on: http://gerrit.cloudera.org:8080/14349
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Commit 27fa27e808d74db3e07be419dd0657c4448534e5 in impala's branch refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=27fa27e ]
IMPALA-8198: DATE: Read from avro.
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the avro scanner.
Similarly to parquet, avro uses DATE logical type for dates. DATE
logical type annotates an INT32 that stores the number of days since
the unix epoch, 1 January 1970.
This representation introduces an avro interoperability issue between
Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
up to 1582-10-05 and Gregorian calendar for dates starting with
1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost. - Impala uses proleptic Gregorian calendar, extending the Gregorian
calendar backward to dates preceding its official introduction in
1582-10-15.
This means that pre-1582-10-15 dates written to an avro table by Hive
will be read back incorrectly by Impala.
Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.
Dependency changes:
- BE uses avro 1.7.4-p5 from native-toolchain.
Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959
Reviewed-on: http://gerrit.cloudera.org:8080/13944
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Commit 2bce974990e19788ec359deec50f06d44ec92048 in impala's branch refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2bce974 ]
IMPALA-7370: DATE: Read/Write to parquet.
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.
Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.
This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
up to 1582-10-05 and Gregorian calendar for dates starting with
1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost. - Impala uses proleptic Gregorian calendar, extending the Gregorian
calendar backward to dates preceding its official introduction in
1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.
Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.
Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Reviewed-on: http://gerrit.cloudera.org:8080/13189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Commit b5805de3e65fd1c7154e4169b323bb38ddc54f4f in impala's branch refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b5805de ]
IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.
This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.
The changes are as follows:
- Support for DATE literal syntax.
- Explicit casting between DATE and other types (note that invalid
casts will fail with an error just like invalid DECIMAL_V2 casts,
while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as
yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
the time component is optional. If the time component is
present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as
yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day
component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day
component is set to 00:00:00.
- Implicit casting between DATE and other types:
- from STRING to DATE if the source string value is used in a
context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a
context where a TIMESTAMP value is expected.
- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
implicit conversions are now all possible, the existing function
overload resolution logic is not adequate anymore.
For example, it resolves the
if(false, '2011-01-01', DATE '1499-02-02') function call to the
if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
function, instead of the if(BOOLEAN, DATE, DATE) version.
This is clearly wrong, so the function overload resolution logic had
to be changed to resolve function calls to the best-fit overloaded
function definition if there are multiple applicable candidates.
An overloaded function definition is an applicable candidate for a
function call if each actual parameter in the function call either
matches the corresponding formal parameter's type (without casting)
or is implicitly castable to that type.
When looking for the best-fit applicable candidate, a parameter
match score (i.e. the number of actual parameters in the function
call that match their corresponding formal parameter's type without
casting) is calculated and the applicable candidate with the highest
parameter match score is chosen.
There's one more issue that the new resolution logic has to address:
if two applicable candidates have the same parameter match score and
the only difference between the two is that the first one requires a
STRING -> TIMESTAMP implicit cast for some of its parameters while
the second one requires a STRING -> DATE implicit cast for the same
parameters then the first candidate has to be chosen not to break
backward compatibility.
E.g: year('2019-02-15') function call must resolve to
year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
implemented yet, so this is not an issue at the moment but it will
be in the future.
When the resolution algorithm considers overloaded function
definitions, first it orders them lexicographically by the types in
their parameter lists. To ensure the backward compatible behavior
Primitivetype.DATE enum value has to come after
PrimitiveType.TIMESTAMP.
- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
math functions. - C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.
These items are tightly coupled and it makes sense to implement them
in one change-set.
Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
corresponding HBASE table 'functional_hbase.date_tbl') was
introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type.
- E2E tests:
- since DATE type is supported for TEXT and HBASE fileformats
only, most DATE tests were implemented separately in
tests/query_test/test_date_queries.py.
Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, seeIMPALA-6169.
Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Commit 9f05cf79fa385ee6a5245ec7cb9ec1b9302c543d in impala's branch refs/heads/branch-4.3.0 from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9f05cf79f ]
IMPALA-10086: Implicit cast comparing char and varcharUntil
IMPALA-7368, Impala allowed comparing char and varchar slots as inselect * from functional.chars_tiny where cs = vc;
IMPALA-7368added DATE type support, and as part of that changedfunction call resolution to use a fit function based on the number of
arguments that match the call types. Previously the comparison above
would take the first matching function, which happened to be equality
between STRING and STRING; CHAR and VARCHAR can both be implicitly cast
to STRING, so this function worked. With the new function resolution,
the best fit is equality between VARCHAR and VARCHAR, however implicit
casting to VARCHAR from CHAR wasn't allowed.
The behavior before
IMPALA-7368was somewhat accidental; it depended onthe order that builtin EQ functions are added via
BinaryPredicate.initBuiltins -> Type.getSupportedTypes. Supported types
happened to be ordered with STRING preceding VARCHAR and CHAR. The fit
function makes sense and changing its behavior may have other
consequences; it also makes sense that CHAR should be castable to
VARCHAR.
This change allows implicit cast between matching types. Functionally it
only changes how we handle char/varchar comparison with wildcard
char/varchar, because decimals are handled before checking for matching
types and other type matching is the same as equals. It now allows
casting to a compatible type when it is a char or varchar and the target
type is a wildcard version of the same.
Does not attempt to address differences from CHAR padding (IMPALA-1652).
Testing:
Change-Id: Ib89d0a391bc8f2152ecd9151c8872a01ba19c436
Reviewed-on: http://gerrit.cloudera.org:8080/20425
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>