[SPARK-48689] Reading lengthy JSON results in a corrupted record. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- Reader
Environment:

Ubuntu 22.04, Python 3.11, and OpenJDK 22

Language:
- python

Description

When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark:

import json
import tempfile
from pyspark.sql import SparkSession# Create a Spark session
spark = (SparkSession.builde
.appName("PySpark JSON Example")
.getOrCreate()
)# Define the JSON content
data = {
"text": "a" * 100000000
}# Create a temporary file
with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:
# Write the JSON content to the temporary file
tmp_file.write(json.dumps(data) + "\n")
tmp_file_path = tmp_file.name # Load the JSON file into a PySpark DataFrame
df = spark.read.json(tmp_file_path) # Print the schema
print(df)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-06-22-15-33-38-833.png
22/Jun/24 07:33
141 kB
Wei Guo

Activity

People

Assignee:: Unassigned

Reporter:: Yuxiang Wei

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Jun/24 00:28

Updated:: 22/Jun/24 17:30

Resolved:: 22/Jun/24 17:30