[SPARK-25109] spark python should retry reading another datanode if the first one fails to connect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed

Description

We use this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and get error as below:

What we can get is that one of the replica block cannot be read for some reason, but spark python doesn't try to read another replica which can be read successfully. So the application fails after throwing exception.

When I use hadoop fs -text to read the file, I can get content correctly. It would be great that spark python can retry reading another replica block instead of failing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

WeChatWorkScreenshot_86b5cccc-1d19-430a-a138-335e4bd3211c.png
14/Aug/18 03:15
94 kB
Yuanbo Liu

Activity

People

Assignee:: Unassigned

Reporter:: Yuanbo Liu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Aug/18 03:14

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/19 05:43