It comes almost for free with SGD neural net codes to put L1 and L2 penalties in as well. I would recommend it.
The trick is that you can't depend on the gradient being sparse so you can't use the lazy regularization. Leon Botou describes
a stochastic full regularization with an adjusted learning rate which should perform comparably. He mostly talks about weight decay (which is L_2 regularization) which can be handled cleverly by keeping a multiplier and a vector. I think L_1 is important, but it requires something like truncated constant decay which can't be done with a multiplier.