[DRILL-7762] Parquet files with too many columns generated in Python (pyarrow, pandas) are not readable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.17.0
Fix Version/s: Future
Component/s: Functions - Drill, SQL Parser, Storage - Parquet
Labels:
- bug
- documentation
- parquet
- python
- query

Description

When launching a query

SELECT * FROM s3.datascience.`./government/shape_file_snappy512.parquet`

on a parquet-file with too many columns generated in Python, I get following error:

User Error Occurred: Error in drill parquet reader (complex). Message: Failure in setting up reader Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema

{ optional int64 OBJECTID_1; optional int64 OBJECTID; optional binary Cs012011 (UTF8); optional double Nis_012011; optional binary Sec012011 (UTF8); optional binary CS102001 (UTF8); optional binary CS031991 (UTF8); optional binary CS031981 (UTF8); optional binary Sector_nl (UTF8); optional binary Sector_fr (UTF8); optional binary Gemeente (UTF8); optional binary Commune (UTF8); optional binary Arrond_nl (UTF8); optional binary Arrond_fr (UTF8); optional binary Prov_nl (UTF8); optional binary Prov_fr (UTF8); optional binary Reg_nl (UTF8); optional binary Reg_fr (UTF8); optional binary Nuts1 (UTF8); optional binary Nuts2 (UTF8); optional binary Nuts3_new (UTF8); optional int64 Inhab; optional double Gis_Perime; optional double Gis_area_h; optional double Cad_area_h; optional double Shape_Leng; optional double Shape_Area; optional binary codesecteu (UTF8); optional binary CD_REFNIS (UTF8); optional binary CD_SECTOR (UTF8); optional double TOTAL; optional double MALES; optional double FEMALES; optional double group0_14; optional double group15_64; optional double group65ETP; optional binary areaofdis (UTF8); }

The parquet file is generated using pyarrow with compression codec 'snappy' and data page size 512MB. Smaller/bigger page sizes give same error. The files originate on on-premise s3 object store (dell ecs). Other queries on the same parquet-file (count, select OBJECTID_1 from .. ) succeed succesfully. Doing a 'select *' on a parquet-file with less columns generated the same way also run without any issues. A workaround is to export a csv-file from Python and generate the parquet file using Drill itself starting from this csv-file (CREATE TABLE s3.datascience.`./government/tes3` AS SELECT * FROM s3.datascience.`./government/shape_file.csv`). Querying a parquet-file generated this way don't result in any problems (although content is exactly the same as parquet-file generated in Python). Is there an explanation why Drill acts this way and what are the specifications of the parquet-file generated by Drill itself (so we can aim to match these specification when creating a parquet-file using Pyarrow/Pandas)?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

error_drill_parquet.doc
01/Jul/20 14:10
330 kB
Maarten D'Haene
shape_file_snappy512.parquet
01/Jul/20 14:10
25 kB
Maarten D'Haene

Activity

People

Assignee:: Unassigned

Reporter:: Maarten D'Haene

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Jul/20 14:13

Updated:: 01/Jul/20 14:13