Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25109

spark python should retry reading another datanode if the first one fails to connect

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:

      Description

      We use this code to read parquet files from HDFS:

      spark.read.parquet('xxx')

      and get error as below:

       

      What we can get is that one of the replica block cannot be read for some reason, but spark python doesn't try to read another replica which can be read successfully. So the application fails after throwing exception.

      When I use hadoop fs -text to read the file, I can get content correctly. It would be great that spark python can retry reading another replica block instead of failing.

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              yuanbo Yuanbo Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: