Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-7590

Flink failed to flush and close the file system output stream for checkpointing because of s3 read timeout

    XMLWordPrintableJSON

Details

    Description

      Flink job failed once over the weekend because of the following issue. It picked itself up afterwards and has been running well. But the issue might worth taking a look at.

      2017-09-03 13:18:38,998 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - reduce (14/18) (c97256badc87e995d456e7a13cec5de9) switched from RUNNING to FAILED.
      AsynchronousException{java.lang.Exception: Could not materialize checkpoint 163 for operator reduce (14/18).}
      	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.Exception: Could not materialize checkpoint 163 for operator reduce (14/18).
      	... 6 more
      Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
      	... 5 more
      	Suppressed: java.lang.Exception: Could not properly cancel managed keyed state future.
      		at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961)
      		... 5 more
      	Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      		at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      		at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      		at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
      		at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
      		at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
      		... 7 more
      	Caused by: java.io.IOException: Could not flush and close the file system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      		at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399)
      		at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
      		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      		at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
      		... 5 more
      	Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler). Response Code: 200, Response Text: OK
      		at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
      		at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
      		at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
      		at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
      		at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50)
      		... 4 more
      	Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425)
      		at com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200)
      		at com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197)
      		at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
      		at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44)
      		at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30)
      		at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
      		... 12 more
      	Caused by: java.net.SocketTimeoutException: Read timed out
      		at java.net.SocketInputStream.socketRead0(Native Method)
      		at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      		at java.net.SocketInputStream.read(SocketInputStream.java:171)
      		at java.net.SocketInputStream.read(SocketInputStream.java:141)
      		at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
      		at sun.security.ssl.InputRecord.read(InputRecord.java:503)
      		at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
      		at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
      		at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
      		at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
      		at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
      		at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
      		at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266)
      		at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
      		at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
      		at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
      		at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      		at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      		at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      		at java.io.InputStreamReader.read(InputStreamReader.java:184)
      		at java.io.BufferedReader.fill(BufferedReader.java:161)
      		at java.io.BufferedReader.read1(BufferedReader.java:212)
      		at java.io.BufferedReader.read(BufferedReader.java:286)
      		at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      		at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown Source)
      		at org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown Source)
      		at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      		at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      		at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      		at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      		at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
      		... 19 more
      	[CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the file system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle]
      2017-09-03 13:18:39,000 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job com.offerup.stream_processing.item_view_stats.ItemViewStatsStreamingApp (aac822203a47d504ecd9b73a77c60cd5) switched from state RUNNING to FAILING.
      AsynchronousException{java.lang.Exception: Could not materialize checkpoint 163 for operator reduce (14/18).}
      	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.Exception: Could not materialize checkpoint 163 for operator reduce (14/18).
      	... 6 more
      Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
      	... 5 more
      	Suppressed: java.lang.Exception: Could not properly cancel managed keyed state future.
      		at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961)
      		... 5 more
      	Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to s3://xxxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      		at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      		at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      		at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
      		at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
      		at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
      		... 7 more
      	Caused by: java.io.IOException: Could not flush and close the file system output stream to s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle
      		at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420)
      		at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399)
      		at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
      		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      		at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
      		at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
      		... 5 more
      	Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler). Response Code: 200, Response Text: OK
      		at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
      		at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
      		at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
      		at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
      		at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152)
      		at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50)
      		... 4 more
      	Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425)
      		at com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200)
      		at com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197)
      		at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
      		at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44)
      		at com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30)
      		at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
      		... 12 more
      	Caused by: java.net.SocketTimeoutException: Read timed out
      		at java.net.SocketInputStream.socketRead0(Native Method)
      		at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      		at java.net.SocketInputStream.read(SocketInputStream.java:171)
      		at java.net.SocketInputStream.read(SocketInputStream.java:141)
      		at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
      		at sun.security.ssl.InputRecord.read(InputRecord.java:503)
      		at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
      		at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
      		at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
      		at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
      		at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
      		at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
      		at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266)
      		at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
      		at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
      		at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
      		at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      		at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      		at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      		at java.io.InputStreamReader.read(InputStreamReader.java:184)
      		at java.io.BufferedReader.fill(BufferedReader.java:161)
      		at java.io.BufferedReader.read1(BufferedReader.java:212)
      		at java.io.BufferedReader.read(BufferedReader.java:286)
      		at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      		at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown Source)
      		at org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown Source)
      		at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      		at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      		at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      		at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      		at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      		at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
      		... 19 more
      	[CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the file system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the stream state handle]
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            phoenixjiangnan Bowen Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: