Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4949 Centralized cache management in HDFS
  3. HDFS-5191

revisit zero-copy API in FSDataInputStream to make it more intuitive

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: HDFS-4949
    • Fix Version/s: HDFS-4949
    • Component/s: hdfs-client, libhdfs
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      As per the discussion on HDFS-4953, we should revisit the zero-copy API to make it more intuitive for new users.

      1. HDFS-5191-caching.010.patch
        129 kB
        Colin Patrick McCabe
      2. HDFS-5191-caching.009.patch
        128 kB
        Colin Patrick McCabe
      3. HDFS-5191-caching.008.patch
        129 kB
        Colin Patrick McCabe
      4. HDFS-5191-caching.007.patch
        131 kB
        Colin Patrick McCabe
      5. HDFS-5191-caching.006.patch
        131 kB
        Colin Patrick McCabe
      6. HDFS-5191-caching.003.patch
        94 kB
        Colin Patrick McCabe
      7. HDFS-5191-caching.001.patch
        60 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Hide
          Colin Patrick McCabe added a comment -

          The current proposal looks like this:

          /**
           * Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned
           * buffer will contain maxLength bytes unless either:
           *  * the read crosses a block boundary and zero copy is being used
           *  * the stream has fewer than maxLength bytes left
           * The returned buffer will either be one that was created by the factory or a MappedByteBuffer.
           */
          public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws IOException;
          
          /**
           * Release a buffer that was returned from readByteBuffer. If the method was created by the factory
           * it will be returned to the factory.
           */
          public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer);
          
          /**
           * Allow application to manage how ByteBuffers are created for fallback buffers. Only buffers created by
           * the factory will be released to it.
           */
          public interface ByteBufferFactory {
            ByteBuffer createBuffer(int capacity);
            void releaseBuffer(ByteBuffer buffer);
          }
          
          Show
          Colin Patrick McCabe added a comment - The current proposal looks like this: /** * Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned * buffer will contain maxLength bytes unless either: * * the read crosses a block boundary and zero copy is being used * * the stream has fewer than maxLength bytes left * The returned buffer will either be one that was created by the factory or a MappedByteBuffer. */ public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws IOException; /** * Release a buffer that was returned from readByteBuffer. If the method was created by the factory * it will be returned to the factory. */ public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer); /** * Allow application to manage how ByteBuffers are created for fallback buffers. Only buffers created by * the factory will be released to it. */ public interface ByteBufferFactory { ByteBuffer createBuffer( int capacity); void releaseBuffer(ByteBuffer buffer); }
          Hide
          Colin Patrick McCabe added a comment -

          I would propose changing read to have a varargs parameter of type ReadOption:

          public ByteBuffer read(ByteBufferFactory factory, int maxLength, ReadOption...)
          

          This will let us add options in the future easily. One ReadOption that we probably want right away is the ability to use mmap even on data that is not cached. This will be useful when reading data that contains its own checksum-- as many file formats do.

          Show
          Colin Patrick McCabe added a comment - I would propose changing read to have a varargs parameter of type ReadOption: public ByteBuffer read(ByteBufferFactory factory, int maxLength, ReadOption...) This will let us add options in the future easily. One ReadOption that we probably want right away is the ability to use mmap even on data that is not cached. This will be useful when reading data that contains its own checksum-- as many file formats do.
          Hide
          Luke Lu added a comment -

          The overhead of using varargs seems excessive (up to 30x timing wise and way more garbage created for the temp arrays using Sun JDK 1.6.0.39) for a performance API. Would just one ReadOption or EnumSet of such things suffice?

          Show
          Luke Lu added a comment - The overhead of using varargs seems excessive (up to 30x timing wise and way more garbage created for the temp arrays using Sun JDK 1.6.0.39) for a performance API. Would just one ReadOption or EnumSet of such things suffice?
          Hide
          Suresh Srinivas added a comment -

          The overhead of using varargs seems excessive (up to 30x timing wise and way more garbage created for the temp arrays using Sun JDK 1.6.0.39) for a performance API.

          While varargs might be expensive, in the context of current API, I do not think the cost you are alluding to seems exaggerated. That said, from clean API perspective, if an EnumSet can get the job done, it might be better.

          Show
          Suresh Srinivas added a comment - The overhead of using varargs seems excessive (up to 30x timing wise and way more garbage created for the temp arrays using Sun JDK 1.6.0.39) for a performance API. While varargs might be expensive, in the context of current API, I do not think the cost you are alluding to seems exaggerated. That said, from clean API perspective, if an EnumSet can get the job done, it might be better.
          Hide
          Colin Patrick McCabe added a comment -

          Good point... I think EnumSet is a better choice

          Show
          Colin Patrick McCabe added a comment - Good point... I think EnumSet is a better choice
          Hide
          Owen O'Malley added a comment -

          +1 for EnumSet

          Show
          Owen O'Malley added a comment - +1 for EnumSet
          Hide
          Colin Patrick McCabe added a comment -

          I implemented the API we've been discussing.

          Some changes: I used EnumSet rather than varags, to avoid varargs overhead.
          The byte buffer release method doesn't need to take a factory, because we track what factory created the ByteBuffer anyway. (We have to do tracking for the memory mapped buffers).

          Show
          Colin Patrick McCabe added a comment - I implemented the API we've been discussing. Some changes: I used EnumSet rather than varags, to avoid varargs overhead. The byte buffer release method doesn't need to take a factory, because we track what factory created the ByteBuffer anyway. (We have to do tracking for the memory mapped buffers).
          Hide
          Andrew Wang added a comment -

          Could someone else take a hack at reviewing this (e.g. Owen, Suresh, Luke)? It'd be nice to get some external verification, else I'll get to it next week.

          Show
          Andrew Wang added a comment - Could someone else take a hack at reviewing this (e.g. Owen, Suresh, Luke)? It'd be nice to get some external verification, else I'll get to it next week.
          Hide
          Chris Nauroth added a comment -

          I am planning to review tomorrow.

          Show
          Chris Nauroth added a comment - I am planning to review tomorrow.
          Hide
          Chris Nauroth added a comment -

          I reviewed the current API and caught up on all prior discussion here and in HDFS-4953. Colin Patrick McCabe, thank you for incorporating the feedback. I'm going to focus on the interface here. I'll review the implementation details soon too. I like to review APIs by writing a little code against them. This is what I came up with (imports trimmed for brevity):

          class Main extends Configured implements Tool {
            private static final int BUFFER_MAX_LENGTH = 4 * 1024 * 1024; // 4 MB
            private static final Log LOG = LogFactory.getLog(Main.class);
          
            @Override
            public int run(String[] args) {
              FileSystem fs = null;
              FSDataInputStream is = null;
          
              try {
                fs = FileSystem.get(this.getConf());
                is = fs.open(new Path(args[0]));
                ByteBufferFactory bbf = new SimpleByteBufferFactory();
                EnumSet<ReadOption> readOpts = EnumSet.noneOf(ReadOption.class);
                for (;;) {
                  ByteBuffer bb = is.read(bbf, BUFFER_MAX_LENGTH, readOpts);
                  if (bb == null) break; // EOF
                  // Use data from the buffer here.
                  bbf.putBuffer(bb);
                }
              } catch (IOException e) {
                LOG.error("Unexpected IOException", e);
              } finally {
                IOUtils.cleanup(LOG, is, fs);
              }
          
              return 0;
            }
          
            public static void main(String[] args) throws Exception {
              System.exit(ToolRunner.run(new Main(), args));
            }
          }
          

          Can someone check that this code (written by someone looking at the API for the first time) is correct? If so, then that's a good sign that we're on the right track. I think this code sample shows that most of the earlier feedback has been addressed. Specifically:

          1. It's a single interface for client read code for both the cached and uncached case. I didn't need to check flags or catch a special exception or downcast to an HDFS class to handle copying vs. zero-copy.
          2. There is generally less code for the client to write, because there are fewer things that the client needs to control. (i.e. no setAllowShortReads)
          3. I did not need to preallocate a fallback buffer that wouldn't be used in the mmap'd case.
          4. I did not need to catch exceptions for main flow control.
          5. The ByteBufferFactory interface would allow the client to control ownership of direct buffers for fallback (and the SimpleByteBufferFactory ships a default implementation that does this).
          6. There is no sharing of mutable state between implicitly connected objects (streams and cursors).
          7. It looks close to the kind of code sample Owen O'Malley wanted to achieve in the comments of HDFS-4953.
          8. There had been discussion of returning array of ByteBuffer vs. returning a single ByteBuffer with possibility of short read. My preference is for the latter. The former would have required me to write another nested loop. I need to code for short read anyway in case the final read before EOF doesn't match my desired read length.

          I think the last remaining open question is around the need for a client to check if zero-copy is available and potentially run a different code path. One way we could achieve that now is by adding a RejectingByteBufferFactory that always throws, and clients that want that behavior can use it instead of SimpleByteBufferFactory. Does anyone have a concrete use case for this? Without a concrete use case, it's hard to say if the interface is sufficient.

          Here are some suggestions, mostly minor adjustments on top of the current patch:

          1. Shall we add overloads of FSDataInputStream#read that don't require the caller to pass ByteBufferFactory (assumes caller wants a new SimpleByteBufferFactory) and don't require the caller to pass read opts (assumes none)?
          2. Can we rename ByteBufferFactory to ByteBufferPool? Luke Lu had made a similar suggestion. "Factory" implies per-call instance creation and "pool" communicates that it needs to control the lifecycle of instances.
          3. Can we move those classes to the .io sub-package? They aren't coupled to anything in .fs.
          4. We need to fully document the expected contract of ByteBufferPool for implementers. Some factors to consider:
            1. Thread-safety - Is it required for the implementation to guarantee thread-safety internally? (I assume thread-safety is only required if the caller intends to use the same one from multiple threads. Whatever the answer, let's document it.)
            2. Null - Is null a legal return value? Is it possible for callers to pass null arguments? (Ideally, null is forbidden, but whatever the answer, let's document it.)
            3. Bounds-checking - What should implementers do if given a negative length? (Runtime exception?)
          5. When the FSDataInputStream is not HasEnhancedByteBufferAccess, we try to fallback to a copying read. However, this failed when I tested my sample code against a local file system, because it ultimately still called FSDataInputStream#read(ByteBuffer), which throws UnsupportedOperationException in the base class. We'll need to fix this to achieve the goal of clients writing a single code path that works for any FileSystem.
          Show
          Chris Nauroth added a comment - I reviewed the current API and caught up on all prior discussion here and in HDFS-4953 . Colin Patrick McCabe , thank you for incorporating the feedback. I'm going to focus on the interface here. I'll review the implementation details soon too. I like to review APIs by writing a little code against them. This is what I came up with (imports trimmed for brevity): class Main extends Configured implements Tool { private static final int BUFFER_MAX_LENGTH = 4 * 1024 * 1024; // 4 MB private static final Log LOG = LogFactory.getLog(Main.class); @Override public int run( String [] args) { FileSystem fs = null ; FSDataInputStream is = null ; try { fs = FileSystem.get( this .getConf()); is = fs.open( new Path(args[0])); ByteBufferFactory bbf = new SimpleByteBufferFactory(); EnumSet<ReadOption> readOpts = EnumSet.noneOf(ReadOption.class); for (;;) { ByteBuffer bb = is.read(bbf, BUFFER_MAX_LENGTH, readOpts); if (bb == null ) break ; // EOF // Use data from the buffer here. bbf.putBuffer(bb); } } catch (IOException e) { LOG.error( "Unexpected IOException" , e); } finally { IOUtils.cleanup(LOG, is, fs); } return 0; } public static void main( String [] args) throws Exception { System .exit(ToolRunner.run( new Main(), args)); } } Can someone check that this code (written by someone looking at the API for the first time) is correct? If so, then that's a good sign that we're on the right track. I think this code sample shows that most of the earlier feedback has been addressed. Specifically: It's a single interface for client read code for both the cached and uncached case. I didn't need to check flags or catch a special exception or downcast to an HDFS class to handle copying vs. zero-copy. There is generally less code for the client to write, because there are fewer things that the client needs to control. (i.e. no setAllowShortReads ) I did not need to preallocate a fallback buffer that wouldn't be used in the mmap'd case. I did not need to catch exceptions for main flow control. The ByteBufferFactory interface would allow the client to control ownership of direct buffers for fallback (and the SimpleByteBufferFactory ships a default implementation that does this). There is no sharing of mutable state between implicitly connected objects (streams and cursors). It looks close to the kind of code sample Owen O'Malley wanted to achieve in the comments of HDFS-4953 . There had been discussion of returning array of ByteBuffer vs. returning a single ByteBuffer with possibility of short read. My preference is for the latter. The former would have required me to write another nested loop. I need to code for short read anyway in case the final read before EOF doesn't match my desired read length. I think the last remaining open question is around the need for a client to check if zero-copy is available and potentially run a different code path. One way we could achieve that now is by adding a RejectingByteBufferFactory that always throws, and clients that want that behavior can use it instead of SimpleByteBufferFactory . Does anyone have a concrete use case for this? Without a concrete use case, it's hard to say if the interface is sufficient. Here are some suggestions, mostly minor adjustments on top of the current patch: Shall we add overloads of FSDataInputStream#read that don't require the caller to pass ByteBufferFactory (assumes caller wants a new SimpleByteBufferFactory ) and don't require the caller to pass read opts (assumes none)? Can we rename ByteBufferFactory to ByteBufferPool ? Luke Lu had made a similar suggestion. "Factory" implies per-call instance creation and "pool" communicates that it needs to control the lifecycle of instances. Can we move those classes to the .io sub-package? They aren't coupled to anything in .fs. We need to fully document the expected contract of ByteBufferPool for implementers. Some factors to consider: Thread-safety - Is it required for the implementation to guarantee thread-safety internally? (I assume thread-safety is only required if the caller intends to use the same one from multiple threads. Whatever the answer, let's document it.) Null - Is null a legal return value? Is it possible for callers to pass null arguments? (Ideally, null is forbidden, but whatever the answer, let's document it.) Bounds-checking - What should implementers do if given a negative length? (Runtime exception?) When the FSDataInputStream is not HasEnhancedByteBufferAccess , we try to fallback to a copying read. However, this failed when I tested my sample code against a local file system, because it ultimately still called FSDataInputStream#read(ByteBuffer) , which throws UnsupportedOperationException in the base class. We'll need to fix this to achieve the goal of clients writing a single code path that works for any FileSystem .
          Hide
          Colin Patrick McCabe added a comment -

          My feeling is that we should support passing a null ByteBufferFactory to mean "don't create fallback ByteBuffers." Using a RejectingByteBufferFactory is another reasonable choice, but it would require more typing for some people.

          I'll add an overload for the empty EnumSet case.

          ByteBufferPool is a better name, I agree.

          I suppose, given that FSDataInputStream is in org.apache.hadoop.io, ByteBufferPool/Factory should be as well.

          ByteBufferPool implementations don't need thread-safety unless multiple read calls are going to be made in parallel using the same pool. I'll add that information to the JavaDoc.

          I agree that the "fallback fallback" path is something that still needs to be done. The problem is, there isn't a very efficient way to do it, since we'd have to read into a byte array, and then copy to the direct byte buffer. We could do better, if we could ask the ByteBufferPool for a non-direct buffer. (i.e., an array-backed buffer). Will this "fallback fallback" case be common enough to motivate this kind of API?

          The disadvantage of this is that then our read function would sometimes return direct byte buffers, and sometimes not, which could lead to code working on local filesystems, and then failing on HDFS (if it tried to call ByteBuffer#array).

          Show
          Colin Patrick McCabe added a comment - My feeling is that we should support passing a null ByteBufferFactory to mean "don't create fallback ByteBuffers ." Using a RejectingByteBufferFactory is another reasonable choice, but it would require more typing for some people. I'll add an overload for the empty EnumSet case. ByteBufferPool is a better name, I agree. I suppose, given that FSDataInputStream is in org.apache.hadoop.io , ByteBufferPool/Factory should be as well. ByteBufferPool implementations don't need thread-safety unless multiple read calls are going to be made in parallel using the same pool. I'll add that information to the JavaDoc. I agree that the "fallback fallback" path is something that still needs to be done. The problem is, there isn't a very efficient way to do it, since we'd have to read into a byte array, and then copy to the direct byte buffer. We could do better, if we could ask the ByteBufferPool for a non-direct buffer. (i.e., an array-backed buffer). Will this "fallback fallback" case be common enough to motivate this kind of API? The disadvantage of this is that then our read function would sometimes return direct byte buffers, and sometimes not, which could lead to code working on local filesystems, and then failing on HDFS (if it tried to call ByteBuffer#array).
          Hide
          Colin Patrick McCabe added a comment -
          • revise libhdfs bindings
          • avoid need for wrappers around ByteBuffers when storing them
          Show
          Colin Patrick McCabe added a comment - revise libhdfs bindings avoid need for wrappers around ByteBuffers when storing them
          Hide
          Chris Nauroth added a comment -

          Passing null to mean "don't create fallback buffers" sounds fine.

          For "fallback fallback", it does sound useful for the ByteBufferPool to be able to provide a guaranteed array-backed ByteBuffer.

          Regarding impact of returning direct vs. non-direct buffers to clients, I think this is acceptable considering that it is part of the existing contract of ByteBuffer. Clients are expected to check hasArray before attempting array operations.

          Show
          Chris Nauroth added a comment - Passing null to mean "don't create fallback buffers" sounds fine. For "fallback fallback", it does sound useful for the ByteBufferPool to be able to provide a guaranteed array-backed ByteBuffer . Regarding impact of returning direct vs. non-direct buffers to clients, I think this is acceptable considering that it is part of the existing contract of ByteBuffer . Clients are expected to check hasArray before attempting array operations.
          Hide
          Colin Patrick McCabe added a comment -
          • add the ability for the ByteBufferPool to create non-direct buffers.
          • fix JNI support and add JNI support for non-direct buffers.
          • add some unit tests
          Show
          Colin Patrick McCabe added a comment - add the ability for the ByteBufferPool to create non-direct buffers. fix JNI support and add JNI support for non-direct buffers. add some unit tests
          Hide
          Colin Patrick McCabe added a comment -

          fix typo in test_libhdfs_zerocopy

          Show
          Colin Patrick McCabe added a comment - fix typo in test_libhdfs_zerocopy
          Hide
          Chris Nauroth added a comment -

          I think this is mostly looking really good. I still haven't reviewed the tests and the libhdfs changes, so I'll look at that on Monday. Just a couple of things I noticed in this pass:

          1. TestByteBufferUtil has a single empty test. Is this the right version of that test suite?
          2. Was there a change intended in DFSUtil? Right now, the patch shows just addition of several imports.
          Show
          Chris Nauroth added a comment - I think this is mostly looking really good. I still haven't reviewed the tests and the libhdfs changes, so I'll look at that on Monday. Just a couple of things I noticed in this pass: TestByteBufferUtil has a single empty test. Is this the right version of that test suite? Was there a change intended in DFSUtil ? Right now, the patch shows just addition of several imports.
          Hide
          Colin Patrick McCabe added a comment -

          I decided to test TestByteBufferUtil#fallbackRead inside TestEnhancedByteBufferAccess. The other file is not needed-- I will remove.

          Removed extra imports in DFSUtil.

          Show
          Colin Patrick McCabe added a comment - I decided to test TestByteBufferUtil#fallbackRead inside TestEnhancedByteBufferAccess . The other file is not needed-- I will remove. Removed extra imports in DFSUtil.
          Hide
          Chris Nauroth added a comment -

          It looks like there are still some import-only changes in DFSUtil.

              if (buffer == null) {
                throw new UnsupportedOperationException("zero-copy reads " +
                    "were not available, and the ByteBufferPool did not provide " +
                    "us with a " + (useDirect ? " direct" : "indirect") +
                    "buffer.");
              }
          

          Minor nit: this exception message would have an extraneous space in the direct case (i.e. "did not provide us with a direct"), and no space between the last two words (i.e. "did not provide us with a indirectbuffer").

                while (true) {
                  countingVisitor.reset();
                  mmapManager.visitEvictable(countingVisitor);
                  if (0 == countingVisitor.count) {
                    break;
                  }
                }
          

          This test would cause an infinite loop if a bug was introduced that left mmaps lingering in evictable state, which could be hard to diagnose. Should we use GenericTestUtils#waitFor, so that there is a timeout?

          Lastly, can you help clarify something about the data structure behind IdentityHashStore for me? In particular, I'm wondering about the math in realloc:

            private void realloc(int newCapacity) {
              Preconditions.checkArgument(newCapacity > 0);
              Object prevBuffer[] = buffer;
              this.capacity = newCapacity;
              int numElements = 1 + (2 * newCapacity);
              this.buffer = new Object[2 * numElements];
              this.numInserted = 0;
              if (prevBuffer != null) {
                for (int i = 0; i < prevBuffer.length; i += 2) {
                  if (prevBuffer[i] != null) {
                    putInternal(prevBuffer[i], prevBuffer[i + 1]);
                  }
                }
              }
            }
          

          put will call realloc to double capacity when needed. numElements is double of the new capacity plus one extra. Then, the size is doubled again for allocating the new array. Therefore the growth pattern looks like:

          capacity == 2, buffer.length == 10
          capacity == 4, buffer.length == 18
          capacity == 8, buffer.length == 34

          It looks like this causes a fair amount of extra unused slack in the array, because only 2 * numInserted elements are used at a time. Is the slack required, or should realloc only double the size once instead of twice? Also, I wasn't sure why we needed 1 extra in the calculation of numElements. Is that a defensive thing so that the loop in putInternal always terminates? I would expect the check of capacity in put to be sufficient to prevent that without needing an extra sentinel element.

          On to the libhdfs changes...

          Show
          Chris Nauroth added a comment - It looks like there are still some import-only changes in DFSUtil . if (buffer == null ) { throw new UnsupportedOperationException( "zero-copy reads " + "were not available, and the ByteBufferPool did not provide " + "us with a " + (useDirect ? " direct" : "indirect" ) + "buffer." ); } Minor nit: this exception message would have an extraneous space in the direct case (i.e. "did not provide us with a direct"), and no space between the last two words (i.e. "did not provide us with a indirectbuffer"). while ( true ) { countingVisitor.reset(); mmapManager.visitEvictable(countingVisitor); if (0 == countingVisitor.count) { break ; } } This test would cause an infinite loop if a bug was introduced that left mmaps lingering in evictable state, which could be hard to diagnose. Should we use GenericTestUtils#waitFor , so that there is a timeout? Lastly, can you help clarify something about the data structure behind IdentityHashStore for me? In particular, I'm wondering about the math in realloc : private void realloc( int newCapacity) { Preconditions.checkArgument(newCapacity > 0); Object prevBuffer[] = buffer; this .capacity = newCapacity; int numElements = 1 + (2 * newCapacity); this .buffer = new Object [2 * numElements]; this .numInserted = 0; if (prevBuffer != null ) { for ( int i = 0; i < prevBuffer.length; i += 2) { if (prevBuffer[i] != null ) { putInternal(prevBuffer[i], prevBuffer[i + 1]); } } } } put will call realloc to double capacity when needed. numElements is double of the new capacity plus one extra. Then, the size is doubled again for allocating the new array. Therefore the growth pattern looks like: capacity == 2, buffer.length == 10 capacity == 4, buffer.length == 18 capacity == 8, buffer.length == 34 It looks like this causes a fair amount of extra unused slack in the array, because only 2 * numInserted elements are used at a time. Is the slack required, or should realloc only double the size once instead of twice? Also, I wasn't sure why we needed 1 extra in the calculation of numElements . Is that a defensive thing so that the loop in putInternal always terminates? I would expect the check of capacity in put to be sufficient to prevent that without needing an extra sentinel element. On to the libhdfs changes...
          Hide
          Colin Patrick McCabe added a comment -

          re: test timouts, I like to add them at the test level using @Test(timeout=...), rather than looking at every part of the test.

          re: hash table sizes. Any time you insert something into IdentityHashStore, you use two array slots... one for the key, and another for the value. The other 2x factor is because the hash table should stay half empty, to avoid the risk of collisions. In other words, we have a load factor of 0.50. This is especially important since every element takes up space (we don't use separate chaining).

          After thinking about it, I don't think we need the extra +1 element. It seems that System#identityHashCode provides a well-distributed enough hash that dividing by a power of two size works well.

          IdentityHashStore was necessary because ByteBuffer#hash and ByteBuffer#equals were just very unsuitable for what I was trying to do. And there's no way to parameterize HashTable to use different functions. Another nice advantage of IdentityHashStore is that it does zero allocations, unless you are growing the hash table. This might seem like a trivial point, but given how frequently read is called, it's nice to avoid generating garbage. We learned that when dealing with the edit log code...

          Show
          Colin Patrick McCabe added a comment - re: test timouts, I like to add them at the test level using @Test(timeout=...), rather than looking at every part of the test. re: hash table sizes. Any time you insert something into IdentityHashStore, you use two array slots... one for the key, and another for the value. The other 2x factor is because the hash table should stay half empty, to avoid the risk of collisions. In other words, we have a load factor of 0.50. This is especially important since every element takes up space (we don't use separate chaining). After thinking about it, I don't think we need the extra +1 element. It seems that System#identityHashCode provides a well-distributed enough hash that dividing by a power of two size works well. IdentityHashStore was necessary because ByteBuffer#hash and ByteBuffer#equals were just very unsuitable for what I was trying to do. And there's no way to parameterize HashTable to use different functions. Another nice advantage of IdentityHashStore is that it does zero allocations, unless you are growing the hash table. This might seem like a trivial point, but given how frequently read is called, it's nice to avoid generating garbage. We learned that when dealing with the edit log code...
          Hide
          Chris Nauroth added a comment -

          In other words, we have a load factor of 0.50.

          That clears it right up for me. Thanks! Do you mind dropping a quick comment on the second 2x multiplication?

          Show
          Chris Nauroth added a comment - In other words, we have a load factor of 0.50. That clears it right up for me. Thanks! Do you mind dropping a quick comment on the second 2x multiplication?
          Hide
          Colin Patrick McCabe added a comment -

          [import-only changes in DFSUtil.java]

          fixed

          ...exception message would have an extraneous space

          fixed

          Should we use GenericTestUtils#waitFor, so that there is a timeout?

          fixed to use waitFor here. even if there is an overall test timeout, it's nicer to use GenericTestUtils#waitFor.

          I also added a comment to IdentityHashStore about the 0.50 load factor, and got rid of the extra slot, so hash table size is now always a power of 2.

          Show
          Colin Patrick McCabe added a comment - [import-only changes in DFSUtil.java] fixed ...exception message would have an extraneous space fixed Should we use GenericTestUtils#waitFor, so that there is a timeout? fixed to use waitFor here. even if there is an overall test timeout, it's nicer to use GenericTestUtils#waitFor. I also added a comment to IdentityHashStore about the 0.50 load factor, and got rid of the extra slot, so hash table size is now always a power of 2.
          Hide
          Chris Nauroth added a comment -

          Great, thank you for incorporating the feedback.

          even if there is an overall test timeout, it's nicer to use GenericTestUtils#waitFor.

          Agreed, it helps narrow the problem down to a specific point in the test code. Thanks for making that change.

          libhdfs looks good too. Just a couple of minor things:

               * You must free all options structures allocated with this function using
               * readZeroOptionsFree.
          

          Change readZeroOptionsFree to hadoopRzOptionsFree.

               *                   This buffer will continue to be valid and readable
               *                   until it is released by readZeroBufferFree.  Failure to
          

          Change readZeroBufferFree to hadoopRzBufferFree.

              hadoopRzOptionsClearCached(env, opts);
              opts->cachedEnumSet = NULL;
          

          Setting cachedEnumSet to NULL isn't required here, because hadoopRzOptionsClearCached already does it.

          I'll be +1 for the whole patch after this. I'm also interested in getting the zero-copy API and client-side mmap'ing changes merged to trunk and branch-2 (but not the additional datanode caching/mlock'ing that's in the HDFS-4949 branch). This will make it easier for clients like ORC to start coding against the APIs. If necessary, I can volunteer to put together a merge patch with just the zero-copy changes.

          Show
          Chris Nauroth added a comment - Great, thank you for incorporating the feedback. even if there is an overall test timeout, it's nicer to use GenericTestUtils#waitFor. Agreed, it helps narrow the problem down to a specific point in the test code. Thanks for making that change. libhdfs looks good too. Just a couple of minor things: * You must free all options structures allocated with this function using * readZeroOptionsFree. Change readZeroOptionsFree to hadoopRzOptionsFree. * This buffer will continue to be valid and readable * until it is released by readZeroBufferFree. Failure to Change readZeroBufferFree to hadoopRzBufferFree. hadoopRzOptionsClearCached(env, opts); opts->cachedEnumSet = NULL; Setting cachedEnumSet to NULL isn't required here, because hadoopRzOptionsClearCached already does it. I'll be +1 for the whole patch after this. I'm also interested in getting the zero-copy API and client-side mmap'ing changes merged to trunk and branch-2 (but not the additional datanode caching/mlock'ing that's in the HDFS-4949 branch). This will make it easier for clients like ORC to start coding against the APIs. If necessary, I can volunteer to put together a merge patch with just the zero-copy changes.
          Hide
          Chris Nauroth added a comment -

          I forgot one more thing. There is still an UnsupportedOperationException thrown in the "fallback fallback" case. The reason is that in ByteBufferUtil#fallbackRead, the stream argument can be an FSDataInputStream. This class implements ByteBufferReadable, so the calculation of useDirect is always true, even if the underlying stream inside the FSDataInputStream doesn't support it.

          Here is a potential change that fixes it. With this in place, we fall through to the array-copying code path, and I don't see the UnsupportedOperationException. Colin, do you want to incorporate this into the patch (or something like it)?

              final boolean useDirect;
              if (stream instanceof FSDataInputStream) {
                FSDataInputStream fsdis = (FSDataInputStream)stream;
                useDirect = fsdis.getWrappedStream() instanceof ByteBufferReadable;
              } else {
                useDirect = stream instanceof ByteBufferReadable;
              }
          
          Show
          Chris Nauroth added a comment - I forgot one more thing. There is still an UnsupportedOperationException thrown in the "fallback fallback" case. The reason is that in ByteBufferUtil#fallbackRead , the stream argument can be an FSDataInputStream . This class implements ByteBufferReadable , so the calculation of useDirect is always true, even if the underlying stream inside the FSDataInputStream doesn't support it. Here is a potential change that fixes it. With this in place, we fall through to the array-copying code path, and I don't see the UnsupportedOperationException . Colin, do you want to incorporate this into the patch (or something like it)? final boolean useDirect; if (stream instanceof FSDataInputStream) { FSDataInputStream fsdis = (FSDataInputStream)stream; useDirect = fsdis.getWrappedStream() instanceof ByteBufferReadable; } else { useDirect = stream instanceof ByteBufferReadable; }
          Hide
          Colin Patrick McCabe added a comment -

          Change readZeroOptionsFree to hadoopRzOptionsFree.

          fixed this and others

          Setting cachedEnumSet to NULL isn't required here, because hadoopRzOptionsClearCached already does it.

          removed

          There is still an UnsupportedOperationException thrown in the "fallback fallback" case...

          Fixed, based on your suggestion. Thanks.

          Show
          Colin Patrick McCabe added a comment - Change readZeroOptionsFree to hadoopRzOptionsFree. fixed this and others Setting cachedEnumSet to NULL isn't required here, because hadoopRzOptionsClearCached already does it. removed There is still an UnsupportedOperationException thrown in the "fallback fallback" case... Fixed, based on your suggestion. Thanks.
          Hide
          Colin Patrick McCabe added a comment -

          I'm also interested in getting the zero-copy API and client-side mmap'ing changes merged to trunk and branch-2 (but not the additional datanode caching/mlock'ing that's in the HDFS-4949 branch). This will make it easier for clients like ORC to start coding against the APIs. If necessary, I can volunteer to put together a merge patch with just the zero-copy changes.

          Based on some earlier discussions, we're planning on merging HDFS-4949 to trunk in a relatively quick timeframe (a few weeks at most).

          However, the zero-copy stuff is easy to separate out, and if you want to put together a merge patch this week (or even later today), Andrew and I will certainly review it.

          Show
          Colin Patrick McCabe added a comment - I'm also interested in getting the zero-copy API and client-side mmap'ing changes merged to trunk and branch-2 (but not the additional datanode caching/mlock'ing that's in the HDFS-4949 branch). This will make it easier for clients like ORC to start coding against the APIs. If necessary, I can volunteer to put together a merge patch with just the zero-copy changes. Based on some earlier discussions, we're planning on merging HDFS-4949 to trunk in a relatively quick timeframe (a few weeks at most). However, the zero-copy stuff is easy to separate out, and if you want to put together a merge patch this week (or even later today), Andrew and I will certainly review it.
          Hide
          Chris Nauroth added a comment -

          +1 for version 10 of the patch. Colin, thank you very much for addressing all of the feedback.

          Show
          Chris Nauroth added a comment - +1 for version 10 of the patch. Colin, thank you very much for addressing all of the feedback.
          Hide
          Colin Patrick McCabe added a comment -

          committed to HDFS-4949 branch. thanks all

          Show
          Colin Patrick McCabe added a comment - committed to HDFS-4949 branch. thanks all
          Hide
          Suresh Srinivas added a comment -

          Colin Patrick McCabe, given that this is an enabler for HDFS-4949, but an independent change, can this be committed to trunk and 2.3 release? This helps in this functionality getting more testing, especially from Hive/ORC side. I or Chris Nauroth can do the merges.

          Show
          Suresh Srinivas added a comment - Colin Patrick McCabe , given that this is an enabler for HDFS-4949 , but an independent change, can this be committed to trunk and 2.3 release? This helps in this functionality getting more testing, especially from Hive/ORC side. I or Chris Nauroth can do the merges.
          Hide
          Colin Patrick McCabe added a comment -

          can this be committed to trunk and 2.3 release?

          Sure. If you prepare a patch which adds zero-copy support to 2.3, I'll review it. (In addition to this patch, you will also have to backport HDFS-4953, either separately or as part of the same backport.)

          Show
          Colin Patrick McCabe added a comment - can this be committed to trunk and 2.3 release? Sure. If you prepare a patch which adds zero-copy support to 2.3, I'll review it. (In addition to this patch, you will also have to backport HDFS-4953 , either separately or as part of the same backport.)
          Hide
          Chris Nauroth added a comment -

          I created HDFS-5260 to track the merge to trunk and branch-2.

          Show
          Chris Nauroth added a comment - I created HDFS-5260 to track the merge to trunk and branch-2.

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development