Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.1
-
None
-
Ubuntu 22.04, Python 3.11, and OpenJDK 22
Description
When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark:
import json
import tempfile
from pyspark.sql import SparkSession# Create a Spark session
spark = (SparkSession.builde
.appName("PySpark JSON Example")
.getOrCreate()
)# Define the JSON content
data = {
"text": "a" * 100000000
}# Create a temporary file
with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:
# Write the JSON content to the temporary file
tmp_file.write(json.dumps(data) + "\n")
tmp_file_path = tmp_file.name # Load the JSON file into a PySpark DataFrame
df = spark.read.json(tmp_file_path) # Print the schema
print(df)