* Motivation A pipelined scan API is introduced for speeding up applications that combine massive data traversal with compute-intensive processing. Traditional HBase scans save network trips through prefetching the data to the client side cache. However, they prefetch synchronously: the fetch request to regionserver is invoked only when the entire cache is consumed. This leads to a stop-and-wait access pattern, in which the client stalls until the next chunk of data is fetched. Applications that do significant processing can benefit from background data prefetching, which eliminates this bottleneck. The pipelined scan implementation overlaps the cache population at the client side with application processing. Namely, it issues a new scan RPC when the iteration retrieves 50% of the cache. If the application processing (that is, the time between invocations of next()) is substantial, the new chunk of data will be available before the previous one is exhausted, and the client will not experience any delay. Ideally, the prefetch and the processing times should be balanced. * API and Configuration Asynchronous scanning can be configured either globally for all tables and scans, or on per-scan basis via a new Scan class API. 1. Configuration in hbase-site.xml - hbase.client.scanner.async.prefetch, default false: hbase.client.scanner.async.prefetch true 2. API - Scan#setAsyncPrefetch(boolean) {code} Scan scan = new Scan(); scan.setCaching(1000); scan.getMaxResultSize(BIG_SIZE); scan.setAsyncPrefetch(true); ... ResultScanner scanner = table.getScanner(scan); {code} * Implementation Notes Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, which is fully API-compatible with the synchronous ClientSimpleScanner. ClientAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) and reversed (Scan#setReversed) scanners. The application is responsible for setting the prefetch size in a way that the prefetch time and the processing times are balanced. Note that due to double buffering, the client side cache can use twice as much memory as the synchronous scanner.