[SPARK-13510] Shuffle may throw FetchFailedException: Direct buffer memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed.

16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337
16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122)
16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337
16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:658)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
	at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
	at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at java.lang.Thread.run(Thread.java:744)
16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.196.134.220:7337 is closed
16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_3_81_2, and will not retry (0 retries)

The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory".
If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw

java.lang.OutOfMemoryError: Java heap space
        at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
        at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)

In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't.
If the block is more than we can cache in memory, we should write to disk.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark-13510.diff
21/Jun/16 01:24
57 kB
shenh062326

Issue Links

duplicates

SPARK-19659 Fetch big blocks to disk when shuffle-read

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: shenh062326

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 26/Feb/16 09:09

Updated:: 28/Feb/20 07:21

Resolved:: 01/Mar/16 11:53