Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
Description
The first table schema becomes a common schema for the full Dataset. It could cause problems with sparse data.
Consider example below, when first chunks is full of NA, pyarrow ignores dtypes from pandas for a whole dataset:
# get dataset !wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv import pandas as pd import pyarrow.parquet as pq import pyarrow as pa import pyarrow.dataset as ds import shutil from pathlib import Path def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000): if Path(output).exists(): shutil.rmtree(output) # write dataset d_items = pd.read_csv(input_csv, index_col='row_id', usecols=['row_id', 'itemid', 'label', 'dbsource', 'category', 'param_type'], dtype={'row_id': int, 'itemid': int, 'label': str, 'dbsource': str, 'category': str, 'param_type': str}, chunksize=chunksize) for i, chunk in enumerate(d_items): table = pa.Table.from_pandas(chunk) if i == 0: schema1 = pa.Schema.from_pandas(chunk) schema2 = table.schema # print(table.field('param_type')) pq.write_to_dataset(table, root_path=output) # read dataset dataset = ds.dataset(output) # compare schemas print('Schemas are equal: ', dataset.schema == schema1 == schema2) print(dataset.schema.types) print('Should be string', dataset.schema.field('param_type')) return dataset
dataset = foo() dataset.to_table() >>>Schemas are equal: False [DataType(int64), DataType(string), DataType(string), DataType(null), DataType(null), DataType(int64)] Should be string pyarrow.Field<param_type: null> --------------------------------------------------------------------------- ArrowTypeError: fields had matching names but differing types. From: category: string To: category: null
If you do schemas listing, you'll see that almost all parquet files ignored pandas dtypes:
import os for i in os.listdir('tmp.parquet/'): print(ds.dataset(os.path.join('tmp.parquet/', i)).schema.field('param_type')) >>>pyarrow.Field<param_type: null> pyarrow.Field<param_type: string> pyarrow.Field<param_type: null> pyarrow.Field<param_type: null> pyarrow.Field<param_type: null> pyarrow.Field<param_type: null> pyarrow.Field<param_type: null> pyarrow.Field<param_type: string> pyarrow.Field<param_type: null> pyarrow.Field<param_type: string> pyarrow.Field<param_type: string> pyarrow.Field<param_type: null> pyarrow.Field<param_type: null>
But if we will get bigger chunk of data, that contains non NA values, everything is OK:
dataset = foo(chunksize=10000) dataset.to_table() >>>Schemas are equal: True [DataType(int64), DataType(string), DataType(string), DataType(string), DataType(string), DataType(int64)] Should be string pyarrow.Field<param_type: string> pyarrow.Table itemid: int64 label: string dbsource: string category: string param_type: string row_id: int64
Check NA in data:
pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique() >>>array([nan]) pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique() >>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process', 'Checkbox'], dtype=object)
PS: switching issues reporting from github to Jira is outstanding move
Attachments
Issue Links
- is duplicated by
-
ARROW-12078 The first table schema becomes a common schema for the full Dataset
- Closed
-
ARROW-12079 The first table schema becomes a common schema for the full Dataset
- Closed