Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10140

[Python][C++] Add test for map column of a parquet file created from pyarrow and pandas

    XMLWordPrintableJSON

Details

    Description

      Hi,

      I'm having problems reading parquet files with 'map' data type created by pyarrow.

      I followed https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries to convert a pandas DF to an arrow table, then call write_table to output a parquet file:

      (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      print(f'PyArrow Version = {pa.__version__}')
      print(f'Pandas Version = {pd.__version__}')
      
      df = pd.DataFrame({
               'col1': pd.Series([
                   [('id', 'something'), ('value2', 'else')],
                   [('id', 'something2'), ('value','else2')],
               ]),
               'col2': pd.Series(['foo', 'bar'])
           })
      
      udt = pa.map_(pa.string(), pa.string())
      schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
      table = pa.Table.from_pandas(df, schema)
      pq.write_table(table, './test_map.parquet')
      

      The above code (attached as test_map.py) runs smoothly on my developing computer:

      PyArrow Version = 1.0.1
      Pandas Version = 1.1.2
      

      And generated the test_map.parquet file (attached as test_map.parquet) successfully.

      Then I use parquet-tools (1.11.1) to read the file, but get the following output:

      $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
      col1:
      .key_value:
      .key_value:
      col2 = foo
      
      col1:
      .key_value:
      .key_value:
      col2 = bar
      

      I also checked the schema of the parquet file:

      java -jar parquet-tools-1.11.1.jar schema test_map.parquet
      message schema {
        optional group col1 (MAP) {
          repeated group key_value {
            required binary key (STRING);
            optional binary value (STRING);
          }
        }
        optional binary col2 (STRING);
      }

      Am I doing something wrong? 

      We need to output the data to parquet files, and query them later.

      Attachments

        1. pyspark.snappy.parquet
          0.9 kB
          Ming Chen
        2. test_map_2.0.0.parquet
          3 kB
          Ming Chen
        3. test_map_200.parquet
          2 kB
          Ming Chen
        4. test_map.parquet
          2 kB
          Ming Chen
        5. test_map.py
          0.6 kB
          Ming Chen

        Issue Links

          Activity

            People

              alenka Alenka Frim
              acan Ming Chen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m