[SPARK-25225] Add support for "List"-Type columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: PySpark, Spark Core
Labels:
None

Description

At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries:

Example input:

ID	date		data
1	2017-01-01	data_1_1
1	2018-02-02	data_1_2
2	2017-03-03	data_2_1
3	2018-04-04	data 2_2
...

Expected outpus:

ID	timeseries
1	[[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2	[[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...

Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of differently-typed elements into SPARK Dataframe columns, as well as create JSON arrays from such columns. See also [~~SPARK-25226~~] and [~~SPARK-25227~~]

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yuriy Davygora

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Aug/18 10:54

Updated:: 12/Dec/22 18:11

Resolved:: 08/Sep/18 05:18