Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Ubuntu x86_64/java-1.6/hadoop-2.0.3
-
Rely on previous sync-points when syncing within the same RCFile and avoid unnecessary I/O
-
rcfile hive
Description
The following function does some bad I/O
public synchronized void sync(long position) throws IOException { ... try { seek(position + 4); // skip escape in.readFully(syncCheck); int syncLen = sync.length; for (int i = 0; in.getPos() < end; i++) { int j = 0; for (; j < syncLen; j++) { if (sync[j] != syncCheck[(i + j) % syncLen]) { break; } } if (j == syncLen) { in.seek(in.getPos() - SYNC_SIZE); // position before // sync return; } syncCheck[i % syncLen] = in.readByte(); } } ... }
This causes a rather large number of readByte() calls which are passed onto a ByteBuffer via a single byte array.
This results in rather a large amount of CPU being burnt in a the linear search for the sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit 100).
This behaviour should be avoided at best or at least replaced by a rolling hash for efficient comparison, since it has a known byte-width of 16 bytes.
Attached the stack trace from a Yourkit profile.
Attachments
Attachments
Issue Links
- relates to
-
HIVE-4423 Improve RCFile::sync(long) 10x
- Closed