Details
-
Task
-
Status: Closed
-
Minor
-
Resolution: Duplicate
-
None
-
None
-
None
Description
Here's a proposal to create a pyarrow.Table.from_pydict() function.
Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s)
http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html
NaN, Integer NA values and NA type promotions:
Sample python code on how this would work.
import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pylist(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pylist(test_list, schema=test_schema)
Attachments
Issue Links
- is duplicated by
-
ARROW-6001 [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records
- Resolved