The approach that most of the projects in Mahout are taking is that non-parallel implementations are very welcome as are parallel implementations.
For most algorithms, it appears that only part of the problem really needs parallelism. In a few cases where there is a significant computation that needs to be parallelized, it is still very helpful to have a good non-parallel implementation of basic matrix operations because block decomposition is generally the best method for these problems.
Thus, it seems to be a good idea to get basic sequential matrix operations in order before jumping into parallel versions.
Even where parallelism has been necessary, it is common that the operations required are not exactly the same as normal matrix operations. For instance, in recommendation systems, coocurrence between items needs to be computed. This looks a lot like a sparse matrix multiply, but it is very handy to be able to inject functionality into the inner accumulation loop. Similarly, large scale sequence comparison looks a lot like matrix multiply on the surface, but the details in the inner loop don't work that way.
In my own work, I have found that it is most useful to use something like Pig to do parallel joins (aka matrix multiplication) and inject my code into the inner loop and then use simpler methods to process the results. You mileage will vary, of course.