I have two distinct use-cases where running TaskTrackers alongside Cassandra nodes does not accomplish our goals:
1. Joining data. We have a large data set in cassandra, true, but we have a much larger data set held in Hadoop itself (around 4 orders of magnitude larger in hadoop than in cassandra). We need to join the two datasets together, and use the output from that join to feed multiple systems, none of which are cassandra. Since the data in Hadoop is so much larger than that in Cassandra, we have to bring the Cassandra data to hadoop, not the other way around. Because of security concerns, we can't spread our hadoop data onto our cassandra nodes (even if that didn't screw with our capacity planning), so we have no other choice but to move the Cassandra data (in small chunks) onto Hadoop. Why not use HBase, you say? We needed Cassandra for its write performance for other problems than this one.
1. Offline, incremental backups. We have a large volume of time-series data held in Cassandra, and taking nightly snapshots and moving them to our archival center is prohibitively slow-
it turns out that moving RF copies of our entire dataset over a leased line every night is a pretty bad idea. Instead, I use MapReduce to take an incremental backup of a much smaller subset of the data, then move that. That way, we not only are not moving the entire data set, but we are also using Cassandra's consistency mechanisms to resolve all the replicas. The only efficient way I've found to do this is via MapReduce (we use the Random Partitioner), and since it's an offline backup, we need to move it over the network anyway-may as well use the optimized network connecting Hadoop and Cassandra instead of the tiny pipe connecting cassandra to our archival center.
Both of these reasons dictate that we not run a TT alongside our Cassandra nodes, no matter what the recommended approach is. In this case, we need a strong, fault-tolerant CFIF to serve our purposes.