Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25225

Add support for "List"-Type columns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.3.1
    • None
    • PySpark, Spark Core
    • None

    Description

      At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type.

      At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries:

      Example input:

      ID	date		data
      1	2017-01-01	data_1_1
      1	2018-02-02	data_1_2
      2	2017-03-03	data_2_1
      3	2018-04-04	data 2_2
      ...
      

      Expected outpus:

      ID	timeseries
      1	[[2017-01-01, data_1_1],[2018-02-02, data1_2]]
      2	[[2017-03-03, data_2_1],[2018-04-04, data2_2]]
      ...
      

      Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column.

      We would like to kindly ask you to implement one of the following:

      1. Extend ArrayType to support elements of different data type

      2. Introduce a new container type (ListType?) which would support elements of different type

      UPDATE: The background here is, that I want to be able to parse JSON-arrays of differently-typed elements into SPARK Dataframe columns, as well as create JSON arrays from such columns. See also [SPARK-25226] and [SPARK-25227]

      Attachments

        Activity

          People

            Unassigned Unassigned
            davygora Yuriy Davygora
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: