Thanks Andrew for the nice thoughts.
I saw in a comment that the raw coders are stateless ...
Sorry I think this should be clarified a little bit and it's partly true: 1) When I said it's stateless, I meant given a coder instance, the encode/decode call is stateless, that is to say, for a certain group of data to encode/decode, we can have concurrent threads to call encode/decode so to speed up. 2) The coder instance has states, specific to schema, configuration (and erasure parameters for decoder).
The biggest concern to merge encoder and decoder together is, the states can be mixed and incur some runtime overhead. For example, both encoder and decoder can maintain big coding matrix in core cache; when merged, the memory consumption can be double (yes it can be avoided by different code path but then complicated). In HDFS side, a coder instance just serves one role, either encode or decode, not both the same time.
Maybe this is what Colin wanted, since the factory classes look trivial by themselves.
I guess you're right, that's most likely what Colin thought. The factory classes are trivial right now, but I think it can evolve to make more sense: 1) In
HADOOP-13665 I think we can do the fallback thing here elegantly, where encoder/decoder creating can fail and it can try next one configured in a list; 2) Quite some time ago when I played with micor benchmarking of the coders, I found cache coder instance can help in performance, and it's good to do it here than elsewhere like in CodecUtil.
I experimented by putting the NativeRS raw encoder and raw decoder into their Factory class, and it looks okay since they're pretty small.
It's a very interesting try. Yes the native RS encoder/decoder are small, for other coders they may not. I thought coders and coder factories may evolve to be bigger in future, for coders if we want to support incremental encoding/decoding, then more codes will be added. As HH codec indicates, if any party supports complex codec, the encode/decode logic can be much complex.
We also should rename RSRawEncoderLegacy to RSLegacyRawEncoder ...
it seems like we should be creating via the appropriate factory whenever possible.
Can't agree any more.
Overall though I think the current system is okay. The factory is the single entry point to configuring a RawCoder.
I'm glad the current system works for you. Do you think it's good to fix above problems for this issue?