Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.4-alpha
-
None
-
Centos (EC2) + short-circuit reads on
Description
When short-circuit reads are on, HDFS client slows down when checksums are turned off.
With checksums on, the query takes 45.341 seconds and with it turned off, it takes 56.345 seconds. This is slower than the speeds observed when short-circuiting is turned off.
The issue seems to be that FSDataInputStream.readByte() calls are directly transferred to the disk fd when the checksums are turned off.
Even though all the columns are integers, the data being read will be read via DataInputStream which does
public final int readInt() throws IOException { int ch1 = in.read(); int ch2 = in.read(); int ch3 = in.read(); int ch4 = in.read();
To confirm, an strace of the Yarn container shows
26690 read(154, "B", 1) = 1 26690 read(154, "\250", 1) = 1 26690 read(154, ".", 1) = 1 26690 read(154, "\24", 1) = 1
To emulate this without the entirety of Hive code, I have written a simpler test app
https://github.com/t3rmin4t0r/shortcircuit-reader
The jar will read a file in -bs <n> sized buffers. Running it with 1 byte blocks gives similar results to the Hive test run.
Attachments
Attachments
Issue Links
- blocks
-
HDFS-4922 Improve the short-circuit document
- Open
- duplicates
-
HDFS-5634 allow BlockReaderLocal to switch between checksumming and not
- Closed
- is depended upon by
-
HDFS-4922 Improve the short-circuit document
- Open
- is related to
-
HDFS-5634 allow BlockReaderLocal to switch between checksumming and not
- Closed
- relates to
-
HDFS-4960 Unnecessary .meta seeks even when skip checksum is true
- Patch Available