Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40284

spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0.1
    • None
    • Input/Output
    • None

    Description

      We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this

      When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write

      First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable

      sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
      
      
      
      -- write sql1
      sparkSession.sql("select 1 as id, sleep(40000) as time").write.mode(SaveMode.Overwrite).parquet("path")
      
      -- write sql2
       sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") 

      When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class.

       

      When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs

      1. excute write sql1, spark  create the _temporary directory for SQL1, and continue

      2. excute write sql2 , spark will  delete the entire target directory and create its own 
      _temporary
      3. sql2 writes  its data

      4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id directory does not exist and so the request fail. However, the task is retried, but the _temporary  directory is not deleted when the task is retried. Therefore, the execution result of sql1  result is append to the target directory 

       

      Based on the above process, the write process, can  spark do a directory check before the write task or some other way to avoid this kind of problem?

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            weiguang Liu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: