Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32285

Add PySpark support for nested timestamps with arrow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • PySpark, SQL
    • None

    Description

      Currently with arrow optimizations, there is post-processing done in pandas for timestamp columns to localize timezone. This is not done for nested columns with timestamps such as StructType or ArrayType.

      Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use of structs with timestamps in groupedby key over a window.

      As a simple first step, timestamps with 1 level nesting could be done first and this will satisfy the immediate need.

      NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing with pyarrow.array.cast, which could be easier done than in pandas.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: