Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-8023

Empty dict page breaks the "old" Parquet reader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • Storage - Parquet
    • None

    Description

      If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark are able to read the dask+pyarrow parquet files.

       

      Example:

      Create the parquet files with and without pyarrow in python.

      import pandas as pd
      import dask.dataframe as dd
      
      df = pd.DataFrame(
          {
              'A': [1, 2, 3],
              'B': ['a', 'b', 'c'],
              'C': [None, None, None]
          }
      )
      
      ddf = dd.from_pandas(df, npartitions=1)
      
      ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
      ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
      

      Read these parquet files with drill:

      Apache Drill 1.19.0
      "Everything is easier with Drill."
      apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
      +---------------------+---+---+------+
      | __null_dask_index__ | A | B |  C   |
      +---------------------+---+---+------+
      | 0                   | 1 | a | null |
      | 1                   | 2 | b | null |
      | 2                   | 3 | c | null |
      +---------------------+---+---+------+
      3 rows selected (0.179 seconds)
      
      apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
      Error: INTERNAL_ERROR ERROR: null
      
      Fragment: 0:0
      
      Please, refer to logs for more information.
      
      [Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
      

      Narrow down to column that is causing the issue:

      apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
      +---+---+
      | A | B |
      +---+---+
      | 1 | a |
      | 2 | b |
      | 3 | c |
      +---+---+
      3 rows selected (0.145 seconds)
      
      apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
      Error: INTERNAL_ERROR ERROR: null
      
      Fragment: 0:0
      Please, refer to logs for more information.
      [Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
      

      Dependency versions:

      Apache Drill 1.19.0
      Python 3.9.7
      dask==2021.10.0
      pyarrow==6.0.0
      fastparquet==0.7.1
      

      Attached are the parquet files I tested with.

      Attachments

        1. fastparquet_test.parquet.tar.gz
          0.8 kB
          Alex Delgado
        2. pyarrow_test.parquet.tar.gz
          2 kB
          Alex Delgado

        Issue Links

          Activity

            People

              dzamo James Turton
              adelg003 Alex Delgado
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: