Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error. Dask and Spark are able to read the dask+pyarrow parquet files.
Example:
Create the parquet files with and without pyarrow in python.
import pandas as pd import dask.dataframe as dd df = pd.DataFrame( { 'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [None, None, None] } ) ddf = dd.from_pandas(df, npartitions=1) ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow') ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
Read these parquet files with drill:
Apache Drill 1.19.0 "Everything is easier with Drill." apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`; +---------------------+---+---+------+ | __null_dask_index__ | A | B | C | +---------------------+---+---+------+ | 0 | 1 | a | null | | 1 | 2 | b | null | | 2 | 3 | c | null | +---------------------+---+---+------+ 3 rows selected (0.179 seconds) apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`; Error: INTERNAL_ERROR ERROR: null Fragment: 0:0 Please, refer to logs for more information. [Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
Narrow down to column that is causing the issue:
apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`; +---+---+ | A | B | +---+---+ | 1 | a | | 2 | b | | 3 | c | +---+---+ 3 rows selected (0.145 seconds) apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`; Error: INTERNAL_ERROR ERROR: null Fragment: 0:0 Please, refer to logs for more information. [Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
Dependency versions:
Apache Drill 1.19.0 Python 3.9.7 dask==2021.10.0 pyarrow==6.0.0 fastparquet==0.7.1
Attached are the parquet files I tested with.
Attachments
Attachments
Issue Links
- is duplicated by
-
DRILL-6105 SYSTEM ERROR: NullPointerException
- Closed