+1 from this casual observer over from Mahout-land (nobody ever seems to believe me that this would make Hadoop programmers soooooo much more efficient).
I've written a half-baked, bug-ridden, inefficient version of this several times in the past, and it would be so useful to have done right.
An api which essentially wrapped a SequenceFile<K,V> and allowed you to do things like
Path dataPath = new Path("hdfs://foo/bar");
PTable<K,V> data = new PTable<K,V>(dataPath);
LightWeightMap<K,V,KOUT,VOUT> map = new MyMapper();
PTable<KOUT,VOUT> transformedData = data.parallelDo(map);
etc. would be awesome.
Of course, the real trick is writing a good optimizer which can figure out how to squish together separate M/R steps into one (for example, parallelDo() returns a PCollection, which you might then do groupByKey() on, but these could often easily be combined into the Map and Reduce steps of a single job).