[SPARK-32285] Add PySpark support for nested timestamps with arrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

Currently with arrow optimizations, there is post-processing done in pandas for timestamp columns to localize timezone. This is not done for nested columns with timestamps such as StructType or ArrayType.

Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use of structs with timestamps in groupedby key over a window.

As a simple first step, timestamps with 1 level nesting could be done first and this will satisfy the immediate need.

NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing with pyarrow.array.cast, which could be easier done than in pandas.

Attachments

Issue Links

is related to

SPARK-21187 Complete support for remaining Spark data types in Arrow Converters

Resolved

links to

[Github] Pull Request #33980 (pralabhkumar)

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 12/Jul/20 22:42

Updated:: 12/Dec/22 18:11