Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7390

kvgen/flatten doesn't produce same result from .json or .parquet

    XMLWordPrintableJSON

Details

    Description

      With a Parquet produce from JSON (ANIMALS_json and ANIMALS_pqt in attachment in tar.gz format)

      CREATE TABLE ....`ANIMALS_pqt` AS 
      (SELECT * FROM ....`ANIMALS_json`);
      

      Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results

      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x)
      =>
      8482290
      
      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x)
      =>
      929430
      

      Or another example:

      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x WHERE x.f.key='Cat')
      =>
      121368
      
      SELECT count(*) FROM 
       (SELECT f FROM 
        (SELECT flatten(k) AS f FROM 
         (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x WHERE x.f.key='Cat')
      =>
      13470
      

      The real result is the json one, as proved by:

      cat ANIMALS_json/*.json | grep -c "Cat"
      121368
      

      Please note that, here, It's appear the particular file ANIMALS_pqt/1_0_0.parquet is not well computed but the other are correct:

      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 107898
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 2429
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 5419
      SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat');
      => 5622
      

                         2429+5419+5622=13470  (result of request on ANIMALS_pqt)
      107898+2429+5419+5622=121368 (result of request on ANIMALS_json)

      Attachments

        1. ANIMALS_json.tar.gz
          7.01 MB
          benj
        2. ANIMALS_pqt.tar.gz
          2.41 MB
          benj

        Activity

          People

            Unassigned Unassigned
            benj641 benj
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: