Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7399

Querying parquet file with boolean data type return wrong results

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.16.0
    • 1.20.1
    • Storage - Parquet
    • None

    Description

      The following query return a wrong value for the boolean column press_run_1:

       SELECT * FROM dfs.root.`/tmp/newrule22_3_1.parquet` WHERE cycle_id=23435119

      The query return press_run_1 = 'false'
      the parquet file contain pess_run_1 = 'true' value for this record.

      You can find many records with this problem if try different selects.

      ATTACHED: newrule22_3_1.parquet file.

      Attachments

        1. newrule22_3_1.parquet
          188 kB
          Fabian Barreiro

        Activity

          Looks like the problem in default Parquet reader, as a workaround consider switching to the second reader to cover this case: set `store.parquet.use_new_reader` = true;

          arina Arina Ielchiieva added a comment - Looks like the problem in default Parquet reader, as a workaround consider switching to the second reader to cover this case: set `store.parquet.use_new_reader` = true;

          It looks like work for my test files.

          alter system set `store.parquet.use_new_reader` = true;

          to change the setting at the system level.

          fbarreiro Fabian Barreiro added a comment - It looks like work for my test files. alter system set `store.parquet.use_new_reader` = true; to change the setting at the system level.
          schlicher Nils Schlicher added a comment -

          I got the same problem in 1.17.0 with my parquet data

           

          The following query returns wrong results. Using the new reader solves the problem, but the new reader is much slower than the default one.

          SELECT COUNT( * ) FROM data WHERE measurement_point_valid= false;

          schlicher Nils Schlicher added a comment - I got the same problem in 1.17.0 with my parquet data   The following query returns wrong results. Using the new reader solves the problem, but the new reader is much slower than the default one. SELECT COUNT( * ) FROM data WHERE measurement_point_valid= false;
          suicas Dave Challis added a comment - - edited

          Also experiencing this issue in 1.17.0.

          I've been unable to find any documentation about store.parquet.use_new_reader though, so am reluctant to switch to using it without knowing about any implications.

          In the options section of Drill, the documentation for it also reads "Not supported in this release.", which makes it sound like this isn't a good option to go with.

          The only workaround I've found is to start writing booleans as integers (0 or 1) instead, though this isn't ideal, as large numbers of existing Parquet files are unusable.

          suicas Dave Challis added a comment - - edited Also experiencing this issue in 1.17.0. I've been unable to find any documentation about store.parquet.use_new_reader  though, so am reluctant to switch to using it without knowing about any implications. In the options section of Drill, the documentation for it also reads " Not supported in this release. ", which makes it sound like this isn't a good option to go with. The only workaround I've found is to start writing booleans as integers (0 or 1) instead, though this isn't ideal, as large numbers of existing Parquet files are unusable.
          dzamo James Turton added a comment -

          Based on my testing with the attached file using 1.19.0 (broken) and 1.20.1 (working), this was fixed in either 1.20.0 or 1.20.1. My transcript from 1.20.1 follows.

           

          Apache Drill 1.20.1
          "Say hello to my little Drill."
          apache drill> alter session set `store.parquet.use_new_reader` = true;
          ok       true
          summary  store.parquet.use_new_reader updated.
          1 row selected (0.51 seconds)
          apache drill> SELECT press_run_1, count() FROM dfs.tmp.`newrule22_3_1.parquet` group by press_run_1;
          press_run_1  false
          EXPR$1       11032
          press_run_1  true
          EXPR$1       9421
          2 rows selected (2.488 seconds)
          apache drill> alter session set `store.parquet.use_new_reader` = false;
          ok       true
          summary  store.parquet.use_new_reader updated.
          1 row selected (0.083 seconds)
          apache drill> SELECT press_run_1, count() FROM dfs.tmp.`newrule22_3_1.parquet` group by press_run_1;
          press_run_1  false
          EXPR$1       11032
          press_run_1  true
          EXPR$1       9421
          2 rows selected (0.424 seconds)
          

           

          dzamo James Turton added a comment - Based on my testing with the attached file using 1.19.0 (broken) and 1.20.1 (working), this was fixed in either 1.20.0 or 1.20.1. My transcript from 1.20.1 follows.   Apache Drill 1.20.1 "Say hello to my little Drill." apache drill> alter session set `store.parquet.use_new_reader` = true ; ok       true summary  store.parquet.use_new_reader updated. 1 row selected (0.51 seconds) apache drill> SELECT press_run_1, count() FROM dfs.tmp.`newrule22_3_1.parquet` group by press_run_1; press_run_1   false EXPR$1       11032 press_run_1   true EXPR$1       9421 2 rows selected (2.488 seconds) apache drill> alter session set `store.parquet.use_new_reader` = false ; ok       true summary  store.parquet.use_new_reader updated. 1 row selected (0.083 seconds) apache drill> SELECT press_run_1, count() FROM dfs.tmp.`newrule22_3_1.parquet` group by press_run_1; press_run_1   false EXPR$1       11032 press_run_1   true EXPR$1       9421 2 rows selected (0.424 seconds)  
          dzamo James Turton added a comment -

          As a matter of interest, the transcript timings show that the old reader is ~6x faster for this query.

          dzamo James Turton added a comment - As a matter of interest, the transcript timings show that the old reader is ~6x faster for this query.

          People

            dzamo James Turton
            fbarreiro Fabian Barreiro
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: