[ARROW-10140] [Python][C++] Add test for map column of a parquet file created from pyarrow and pandas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 7.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26151

Description

Hi,

I'm having problems reading parquet files with 'map' data type created by pyarrow.

I followed https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries to convert a pandas DF to an arrow table, then call write_table to output a parquet file:

(We also referred to https://issues.apache.org/jira/browse/ARROW-9812)

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df = pd.DataFrame({
         'col1': pd.Series([
             [('id', 'something'), ('value2', 'else')],
             [('id', 'something2'), ('value','else2')],
         ]),
         'col2': pd.Series(['foo', 'bar'])
     })

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')

The above code (attached as test_map.py) runs smoothly on my developing computer:

PyArrow Version = 1.0.1
Pandas Version = 1.1.2

And generated the test_map.parquet file (attached as test_map.parquet) successfully.

Then I use parquet-tools (1.11.1) to read the file, but get the following output:

$ java -jar parquet-tools-1.11.1.jar head test_map.parquet
col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

I also checked the schema of the parquet file:

java -jar parquet-tools-1.11.1.jar schema test_map.parquet
message schema {
  optional group col1 (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }
  optional binary col2 (STRING);
}

Am I doing something wrong?

We need to output the data to parquet files, and query them later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pyspark.snappy.parquet
01/Oct/20 15:21
0.9 kB
Ming Chen
test_map_2.0.0.parquet
05/Oct/20 06:48
3 kB
Ming Chen
test_map_200.parquet
03/Nov/20 05:39
2 kB
Ming Chen
test_map.parquet
30/Sep/20 08:21
2 kB
Ming Chen
test_map.py
30/Sep/20 08:20
0.6 kB
Ming Chen

Issue Links

links to

GitHub Pull Request #12176

Activity

People

Assignee:: Alenka Frim

Reporter:: Ming Chen

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/Sep/20 08:20

Updated:: 11/Jan/23 08:11

Resolved:: 18/Jan/22 15:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m