Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Information Provided
-
2.3.1
-
None
-
None
Description
Dear Spark and PySpark maintainers,
I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.
At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.
Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like
{"containsNull":true,"elementType":"string","type":"array"}
does not work either, the error message says, among other things,
data type mismatch: Input schema array<string> must be a struct or an array of structs.
Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like
["string_value", 0, true, null]
which is JSON-valid with schema
{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
(and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words
ParseException: u'\nmismatched input
Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):
import pandas as pd from pyspark.sql import SparkSession import pyspark.sql.functions as F from pyspark.sql.types import StringType, ArrayType spark = SparkSession.builder.appName('test').getOrCreate() data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']} pdf = pd.DataFrame.from_dict(data) df = spark.createDataFrame(pdf) df.show() df = df.withColumn("parsed_data", F.from_json(F.col('data'), ArrayType(StringType()))) # Does not work, because not a struct of array of structs df = df.withColumn("parsed_data", F.from_json(F.col('data'), '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all
For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!
==================
UPDATE: By the way, apparently the to_json function has the same problems: it cannot convert an array-typed column to a JSON-string. It would be nice for it to support arrays, as well. And, speaking of problem 2, an array column of different types cannot be even created in the first place.