Details
-
Improvement
-
Status: Resolved
-
Low
-
Resolution: Won't Fix
-
None
-
None
Description
Currently CFIF splits the wrap-around split into two non-wrap-around splits. While it simplifies CFRR implementation, this approach has several minor downsides:
- One of the splits can be extremely small. One of our (picky) customers suspected there must be a bug, because one of his map tasks executed in 1 second, while all the rest executed in minutes. Also having a very small task is wasting resources - more resources go to launching the task than doing any real work.
- The number of map tasks is always one more than the number of (expected rows / cassandra.input.split.size). The number of map tasks is always >= 2. This is confusing customers.
- Progress reporting for the divided split parts is inaccurate - even if the splits are similar in size, the progress bar goes to about 50% and then immediately to 100%, because it is impossible to estimate their size properly (the size estimation is done before removing wrap-around).