Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Impala 3.0
-
ghx-label-5
Description
"Impala has no DELETE statement." and "Impala has no UPDATE statement. " are not totally true - Impala has those statements but only for Kudu tables.
"For example, Impala does not support natural joins or anti-joins," - Impala does support Anti-joins via NOT IN/NOT EXISTS or even explicitly like:
select * from functional.alltypes a1 left anti join functional.alltypestiny a2 on a1.id = a2.id;
"Within queries, Impala requires query aliases for any subqueries:" - this is only true for subqueries used as inline views in the FROM clause. E.g. the following works:
select * from functional.alltypes where id = (select min(id) from functional.alltypes);
" Impala .. requires the CROSS JOIN operator for Cartesian products." - untrue, this works:
select * from functional.alltypes t1, functional.alltypes t2;
"Have you run the COMPUTE STATS statement on each table involved in join queries". This isn't specific to queries with joins, although may have more impact. We recommend that users run COMPUTE STATS on all tables.
"A CREATE TABLE statement with no PARTITIONED BY clause stores all the data files in the same physical location," - unpartitioned tables with multiple files can have files residing in different locations (and there are already 3 replicas per file by default, so the statement is a little misleading even if there's a single file). I think the latest statement about "Have you partitioned at the right granularity so that there is enough data in each partition to parallelize the work for each query?" is also misleading for the same reason.
"The INSERT ... VALUES syntax is suitable for setting up toy tables with a few rows for functional testing, but because each such statement creates a separate tiny file in HDFS". This advice only applies to HDFS, this should work fine for Kudu tables although the INSERT statements are not particularly fast.
"The number of expressions allowed in an Impala query might be smaller than for some other database systems, causing failures for very complicated queries" - this doesn't seem right - I don't know why the queries would fail. Also the codegen time isn't really specific to expressions or where clauses. There seems to be a point buried in there, but maybe it's just essentially that "Complex queries may have high codegen time"