[KUDU-1276] Add a vectorized read/write interface for pandas DataFrame objects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: client, python
Labels:
None

Description

A pandas read/write interface would make Kudu significantly easier to use for average Python data users.

The layering is as follows:

Writer: "Vectorized" insert that accepts a C/C++ array of values plus an array (either bits or bytes) indicating nullness for nullable slots

Reader: Converts a row batch to NumPy arrays with missing data representation suitable for use in pandas. Ideally should not create more than one PyString object for each observed string value. Binary can be encoded as UTF8 string, while Timestamp will need to be converted to nanoseconds for pandas

This would also give a very performant and relatively GIL-free data ingest path to the Kudu (and Kudu consumers like Impala) without a great deal of Python+Cython coding.

Attachments

Issue Links

is related to

KUDU-2077 Return data in Apache Arrow format

Reopened

Activity

People

Assignee:: Jordan Birdsell

Reporter:: Wes McKinney

Votes:: 3 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 08/Dec/15 04:46

Updated:: 01/Jun/20 14:57