Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12080

[Python][Dataset] The first table schema becomes a common schema for the full Dataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Python

    Description

      The first table schema becomes a common schema for the full Dataset. It could cause problems with sparse data.

      Consider example below, when first chunks is full of NA, pyarrow ignores dtypes from pandas for a whole dataset:

      # get dataset
      !wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv
      
      
      import pandas as pd 
      import pyarrow.parquet as pq
      import pyarrow as pa
      import pyarrow.dataset as ds
      import shutil
      from pathlib import Path
      
      
      def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
          if Path(output).exists():
              shutil.rmtree(output)    # write dataset
          d_items = pd.read_csv(input_csv, index_col='row_id',
                            usecols=['row_id', 'itemid', 'label', 'dbsource', 'category', 'param_type'],
                            dtype={'row_id': int, 'itemid': int, 'label': str, 'dbsource': str,
                                   'category': str, 'param_type': str}, chunksize=chunksize)    for i, chunk in enumerate(d_items):
              table = pa.Table.from_pandas(chunk)
              if i == 0:
                  schema1 = pa.Schema.from_pandas(chunk)
                  schema2 = table.schema
      #         print(table.field('param_type'))
              pq.write_to_dataset(table, root_path=output)
          
          # read dataset
          dataset = ds.dataset(output)
          
          # compare schemas
          print('Schemas are equal: ', dataset.schema == schema1 == schema2)
          print(dataset.schema.types)
          print('Should be string', dataset.schema.field('param_type'))    
          return dataset
      
      dataset = foo()
      dataset.to_table()
      
      >>>Schemas are equal:  False
      [DataType(int64), DataType(string), DataType(string), DataType(null), DataType(null), DataType(int64)]
      Should be string pyarrow.Field<param_type: null>
      ---------------------------------------------------------------------------
      ArrowTypeError: fields had matching names but differing types. From: category: string To: category: null

      If you do schemas listing, you'll see that almost all parquet files ignored pandas dtypes:

      import os
      
      for i in os.listdir('tmp.parquet/'):
          print(ds.dataset(os.path.join('tmp.parquet/', i)).schema.field('param_type'))
      
      >>>pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: string>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: string>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: string>
      pyarrow.Field<param_type: string>
      pyarrow.Field<param_type: null>
      pyarrow.Field<param_type: null>
      

      But if we will get bigger chunk of data, that contains non NA values, everything is OK:

      dataset = foo(chunksize=10000)
      dataset.to_table()
      
      >>>Schemas are equal:  True
      [DataType(int64), DataType(string), DataType(string), DataType(string), DataType(string), DataType(int64)]
      Should be string pyarrow.Field<param_type: string>
      pyarrow.Table
      itemid: int64
      label: string
      dbsource: string
      category: string
      param_type: string
      row_id: int64
      

      Check NA in data:

      pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
      >>>array([nan])
      
      pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
      >>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
             'Checkbox'], dtype=object)
      

       

       PS: switching issues reporting from github to Jira is outstanding move

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              banderlog Borys Kabakov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: