Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48689

Reading lengthy JSON results in a corrupted record.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.1
    • None
    • Spark Core
    • Ubuntu 22.04, Python 3.11, and OpenJDK 22

    Description

      When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark:

      import json
      import tempfile
      from pyspark.sql import SparkSession# Create a Spark session
      spark = (SparkSession.builde
          .appName("PySpark JSON Example")
          .getOrCreate()
      )# Define the JSON content
      data = {
          "text": "a" * 100000000
      }# Create a temporary file
      with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:
          # Write the JSON content to the temporary file
          tmp_file.write(json.dumps(data) + "\n")
          tmp_file_path = tmp_file.name    # Load the JSON file into a PySpark DataFrame
          df = spark.read.json(tmp_file_path)    # Print the schema
          print(df)

      Attachments

        Activity

          People

            Unassigned Unassigned
            universefly Yuxiang Wei
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: