[SPARK-25195] Extending from_json function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Information Provided
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: PySpark, Spark Core
Labels:
None

Description

Dear Spark and PySpark maintainers,

I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.

At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.

Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like

{"containsNull":true,"elementType":"string","type":"array"}

does not work either, the error message says, among other things,

data type mismatch: Input schema array<string> must be a struct or an array of structs.

Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like

["string_value", 0, true, null]

which is JSON-valid with schema

{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}

(and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words

ParseException: u'\nmismatched input

Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):

import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType

spark = SparkSession.builder.appName('test').getOrCreate()

data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']}
pdf = pd.DataFrame.from_dict(data)
df = spark.createDataFrame(pdf)
df.show()

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    ArrayType(StringType()))) # Does not work, because not a struct of array of structs

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all

For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!

==================

UPDATE: By the way, apparently the to_json function has the same problems: it cannot convert an array-typed column to a JSON-string. It would be nice for it to support arrays, as well. And, speaking of problem 2, an array column of different types cannot be even created in the first place.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yuriy Davygora

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Aug/18 12:51

Updated:: 24/Aug/18 10:57

Resolved:: 24/Aug/18 10:36