Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-3992

Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.11.0
    • None
    • None
    • Ubuntu x86_64/java-1.6/hadoop-2.0.3

    • Rely on previous sync-points when syncing within the same RCFile and avoid unnecessary I/O
    • rcfile hive

    Description

      The following function does some bad I/O

      public synchronized void sync(long position) throws IOException {
        ...
            try {
              seek(position + 4); // skip escape
              in.readFully(syncCheck);
              int syncLen = sync.length;
              for (int i = 0; in.getPos() < end; i++) {
                int j = 0;
                for (; j < syncLen; j++) {
                  if (sync[j] != syncCheck[(i + j) % syncLen]) {
                    break;
                  }
                }
                if (j == syncLen) {
                  in.seek(in.getPos() - SYNC_SIZE); // position before
                  // sync
                  return;
                }
                syncCheck[i % syncLen] = in.readByte();
              }
            }
      ...
          }
      

      This causes a rather large number of readByte() calls which are passed onto a ByteBuffer via a single byte array.

      This results in rather a large amount of CPU being burnt in a the linear search for the sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit 100).

      This behaviour should be avoided at best or at least replaced by a rolling hash for efficient comparison, since it has a known byte-width of 16 bytes.

      Attached the stack trace from a Yourkit profile.

      Attachments

        1. HIVE-3992.3.patch
          4 kB
          Gopal Vijayaraghavan
        2. HIVE-3992.2.patch
          3 kB
          Gopal Vijayaraghavan
        3. HIVE-3992.patch
          3 kB
          Gopal Vijayaraghavan
        4. select-join-limit.html
          146 kB
          Gopal Vijayaraghavan

        Issue Links

          Activity

            People

              gopalv Gopal Vijayaraghavan
              gopalv Gopal Vijayaraghavan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: