[ARROW-12080] [Python][Dataset] The first table schema becomes a common schema for the full Dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: Python
Labels:
- dataset
- datasets

External issue URL:
https://github.com/apache/arrow/issues/27905

Description

The first table schema becomes a common schema for the full Dataset. It could cause problems with sparse data.

Consider example below, when first chunks is full of NA, pyarrow ignores dtypes from pandas for a whole dataset:

# get dataset
!wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv


import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa
import pyarrow.dataset as ds
import shutil
from pathlib import Path


def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
    if Path(output).exists():
        shutil.rmtree(output)    # write dataset
    d_items = pd.read_csv(input_csv, index_col='row_id',
                      usecols=['row_id', 'itemid', 'label', 'dbsource', 'category', 'param_type'],
                      dtype={'row_id': int, 'itemid': int, 'label': str, 'dbsource': str,
                             'category': str, 'param_type': str}, chunksize=chunksize)    for i, chunk in enumerate(d_items):
        table = pa.Table.from_pandas(chunk)
        if i == 0:
            schema1 = pa.Schema.from_pandas(chunk)
            schema2 = table.schema
#         print(table.field('param_type'))
        pq.write_to_dataset(table, root_path=output)
    
    # read dataset
    dataset = ds.dataset(output)
    
    # compare schemas
    print('Schemas are equal: ', dataset.schema == schema1 == schema2)
    print(dataset.schema.types)
    print('Should be string', dataset.schema.field('param_type'))    
    return dataset

dataset = foo()
dataset.to_table()

>>>Schemas are equal:  False
[DataType(int64), DataType(string), DataType(string), DataType(null), DataType(null), DataType(int64)]
Should be string pyarrow.Field<param_type: null>
---------------------------------------------------------------------------
ArrowTypeError: fields had matching names but differing types. From: category: string To: category: null

If you do schemas listing, you'll see that almost all parquet files ignored pandas dtypes:

import os

for i in os.listdir('tmp.parquet/'):
    print(ds.dataset(os.path.join('tmp.parquet/', i)).schema.field('param_type'))

>>>pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>

But if we will get bigger chunk of data, that contains non NA values, everything is OK:

dataset = foo(chunksize=10000)
dataset.to_table()

>>>Schemas are equal:  True
[DataType(int64), DataType(string), DataType(string), DataType(string), DataType(string), DataType(int64)]
Should be string pyarrow.Field<param_type: string>
pyarrow.Table
itemid: int64
label: string
dbsource: string
category: string
param_type: string
row_id: int64

Check NA in data:

pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
>>>array([nan])

pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
>>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
       'Checkbox'], dtype=object)

PS: switching issues reporting from github to Jira is outstanding move

Attachments

Issue Links

is duplicated by

ARROW-12078 The first table schema becomes a common schema for the full Dataset

Closed

ARROW-12079 The first table schema becomes a common schema for the full Dataset

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Borys Kabakov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Mar/21 21:31

Updated:: 11/Jan/23 08:24