Accumulo currently provides mechanisms to initiate bulk imports and to list bulk imports in progress. Scheduling of bulk import requests is not entirely deterministic, and most of the execution of a bulk-import request is done in a non-preemptable manner. As such, any bulk import which takes very long to complete can block bulk imports with higher operational priority for significant periods.
To better support bulk-import-heavy applications, it would be nice if Accumulo would offer additional mechanisms for controlling the scheduling and execution of bulk imports, such as the abilities to:
- Pause/resume bulk import in progress.
- Prioritize/reprioritize bulk import requests.
- Cancel bulk import in progress. If possible, cancelling a partially completed bulk import request should result in a rollback of changes. That is, a bulk import should either succeed or make no changes.
Additionally, for multitenant situations, it would be nice if Accumulo would:
- Provide multiple queues for bulk import requests. Each queue would have its requests worked serially in priority order. Requests in separate queues should be worked in parallel, or have time distributed among the queues in some manner as to make work appear roughly parallel.
Implementation-wise, I'm thinking of rewriting much of the current bulk-loading logic. While the current logic depends upon multiple threads executing (potentially long-duration) blocking RPC calls, I'd like to move to a more event-driven/message-passing model backed by a persistent state machine.
Current ideas I'm playing around with (very tentative)
- Creating a new table accumulo.bulk_load_queues to keep track of bulk load progress.
- Distributing bulk load orchestration via a mechanism similar to tablet assignment instead of the current blocking RPC calls (LoadFiles.java:156).
- Implementing something akin to a two-phase commit to achieve rollback behavior on failure.