Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-8599

Improve the failure behavior of the FileInputFormat for bad files

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3.2, 1.4.0
    • Fix Version/s: None
    • Component/s: API / DataStream
    • Labels:
      None

      Description

      So we have a s3 path that flink is monitoring that path to see new files available.

      val avroInputStream_activity = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000)  

       
      I am doing both internal and external check pointing and let's say there is a bad file (for example, a different schema been dropped in this folder) came to the path and flink will do several retries. I want to take those bad files and let the process continue. However, since the file path persist in the checkpoint, when I try to resume from external checkpoint, it threw the following error on no file been found.
       

      java.io.IOException: Error opening the Input Split s3a://myfile [0,904]: No such file or directory: s3a://myfile

       
      As Fabian Hueske suggested, we could check if a path exists and before trying to read a file and ignore the input split instead of throwing an exception and causing a failure.
       
      Also, I am thinking about add an error output for bad files as an option to users. So if there is any bad files exist we could move them in a separated path and do further analysis. 
       
      Not sure how people feel about it, but I'd like to contribute on it if people think this can be an improvement. 

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ChengzhiZhao Chengzhi Zhao
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: