Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4349

parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.5.0
    • Component/s: Storage - Parquet
    • Labels:
      None

      Description

      While reading a nullable column, if in a single pass we only read null values, the parquet reader resets the value of pageReader.readPosInBytes which will lead to wrong data read from the file.

      To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, val) with 50100 rows, where id equals to the row number and val is empty for the first 50k rows, and equal to id for the remaining rows.

      create a parquet table from the csv file:

      CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
      

      Now if you query any of the non null values you will get wrong results:

      0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=50000 limit 10;
      +--------+---------------------------+
      |   id   |            val            |
      +--------+---------------------------+
      | 50000  | 9.11337776337441E-309     |
      | 50001  | 3.26044E-319              |
      | 50002  | 1.4916681476489723E-154   |
      | 50003  | 2.0000000018890676        |
      | 50004  | 2.681561588521345E154     |
      | 50005  | -2.1016574E-317           |
      | 50006  | -1.4916681476489723E-154  |
      | 50007  | -2.0000000018890676       |
      | 50008  | -2.681561588521345E154    |
      | 50009  | 2.1016574E-317            |
      +--------+---------------------------+
      10 rows selected (0.238 seconds)
      

      and here are the expected values:

      0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as int)>=50000 limit 10;
      +--------------------+
      |      columns       |
      +--------------------+
      | ["50000","50000"]  |
      | ["50001","50001"]  |
      | ["50002","50002"]  |
      | ["50003","50003"]  |
      | ["50004","50004"]  |
      | ["50005","50005"]  |
      | ["50006","50006"]  |
      | ["50007","50007"]  |
      | ["50008","50008"]  |
      | ["50009","50009"]  |
      +--------------------+
      

      I confirmed that the file is written correctly and the issue is in the parquet reader (already have a fix for it)

        Attachments

        1. drill4349.tar.gz
          334 kB
          Abdel Hakim Deneche

          Activity

            People

            • Assignee:
              jaltekruse Jason Altekruse
              Reporter:
              adeneche Abdel Hakim Deneche
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: