Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25195

Extending from_json function

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Information Provided
    • 2.3.1
    • None
    • PySpark, Spark Core
    • None

    Description

      Dear Spark and PySpark maintainers,

      I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.

      At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.

      Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like

      {"containsNull":true,"elementType":"string","type":"array"}

      does not work either, the error message says, among other things,

      data type mismatch: Input schema array<string> must be a struct or an array of structs.

      Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like

      ["string_value", 0, true, null]

      which is JSON-valid with schema

      {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}

      (and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words

      ParseException: u'\nmismatched input

      Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):

      import pandas as pd
      
      from pyspark.sql import SparkSession
      import pyspark.sql.functions as F
      from pyspark.sql.types import StringType, ArrayType
      
      spark = SparkSession.builder.appName('test').getOrCreate()
      
      data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']}
      pdf = pd.DataFrame.from_dict(data)
      df = spark.createDataFrame(pdf)
      df.show()
      
      df = df.withColumn("parsed_data", F.from_json(F.col('data'),
          ArrayType(StringType()))) # Does not work, because not a struct of array of structs
      
      df = df.withColumn("parsed_data", F.from_json(F.col('data'),
          '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all
        

      For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!

      ==================

      UPDATE: By the way, apparently the to_json function has the same problems: it cannot convert an array-typed column to a JSON-string. It would be nice for it to support arrays, as well. And, speaking of problem 2, an array column of different types cannot be even created in the first place.

      Attachments

        Activity

          People

            Unassigned Unassigned
            davygora Yuriy Davygora
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: