I think that there's still something that I think I'm not totally getting, or I'm looking at this from the wrong angle.
Taking two different cases of the same kind of thing, I think we need to be able to distinguish how we want them to be dealt with, as follows.
In this example, we want the DoFns leading up to the toBundle() call to be run in a separate job:
PTable<K,V> hugeTable = pipeline.read(...);
PTable<K,V> muchSmallerTable = hugeTable.parallelDo(myFilterFn);
PTable<K,<U,V>> joined = new MapsideJoinStrategy().join(left, muchSmallerTable);
and in this example we want the DoFns to be run in memory while directly reading in smallTable from the Source
PTable<K,V> smallTable = pipeline.read(...);
PTable<K,V> filteredSmallTable = smallTable.parallelDo(myFilterFn);
PTable<K,<U,V>> joined = new MapsideJoinStrategy().join(left, filteredSmallTable);
I think that the point is that there needs to be the ability in the API to set a spot where we can say "everything from here on in will be run in memory". We can do that with something on ParallelDoOptions, but I think we can run into the same problem again where it's hard to define what will (or even what should) happen if you want to write a PCollection to storage if it's got some in-memory operations defined somewhere further upstream in the Pipeline.
FWIW, I'm pretty much convinced that the MaterializedPCollection isn't the way to go for this.