Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7390

kvgen/flatten doesn't produce same result from .json or .parquet

    XMLWordPrintableJSON

    Details

      Description

      With a Parquet produce from JSON (ANIMALS_json and ANIMALS_pqt in attachment in tar.gz format)

      CREATE TABLE ....`ANIMALS_pqt` AS 
      (SELECT * FROM ....`ANIMALS_json`);
      

      Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results

      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x)
      =>
      8482290
      
      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x)
      =>
      929430
      

      Or another example:

      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x WHERE x.f.key='Cat')
      =>
      121368
      
      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x WHERE x.f.key='Cat')
      =>
      13470
      

      The real result is the json one, as proved by:

      cat ANIMALS_json/*.json | grep -c "Cat"
      121368
      

      Please note that, here, It's appear the particular file ANIMALS_pqt/1_0_0.parquet is not well computed but the other are correct:

      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 107898
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 2429
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 5419
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 5622
      

                         2429+5419+5622=13470  (result of request on ANIMALS_pqt)
      107898+2429+5419+5622=121368 (result of request on ANIMALS_json)

        Attachments

        1. ANIMALS_pqt.tar.gz
          2.41 MB
          benj
        2. ANIMALS_json.tar.gz
          7.01 MB
          benj

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              benj641 benj
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: