We found a case where Impala returns incorrect values from simple query. Our data contains nested array of structures and structures contains other structures.
We generated minimal sample data allowing to reproduce the issue.
SQL to create a table:
Please put attached parquet file to the location of the table and refresh the table.
In sample data we have 2 users, one with 2 devices, second one with 3. Some of the devices.device_info.model fields are NULL.
When I issue a query:
I'm expecting to get 5 records in results, but getting only one1.png
If I change query to:
I'm getting two records in the results, but still not as it should be.
We found some workaround to this problem. If we add to the result columns device.id we will get all records from parquet file:
And result is 3.png
But we can't rely on this workaround, because we don't need device.id in all queries and Impala optimizes it, and as a result we are getting unpredicted results.
I tested Hive query on this table and it returns expected results:
Please advice if it's a problem in Impala engine or we did some mistake in our query.