Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.16.0
-
None
-
None
Description
With a Parquet produce from JSON (ANIMALS_json and ANIMALS_pqt in attachment in tar.gz format)
CREATE TABLE ....`ANIMALS_pqt` AS (SELECT * FROM ....`ANIMALS_json`);
Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x) => 8482290 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x) => 929430
Or another example:
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x WHERE x.f.key='Cat') => 121368 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x WHERE x.f.key='Cat') => 13470
The real result is the json one, as proved by:
cat ANIMALS_json/*.json | grep -c "Cat"
121368
Please note that, here, It's appear the particular file ANIMALS_pqt/1_0_0.parquet is not well computed but the other are correct:
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat'); => 107898 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat'); => 2429 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5419 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5622
2429+5419+5622=13470 (result of request on ANIMALS_pqt)
107898+2429+5419+5622=121368 (result of request on ANIMALS_json)