Wrote a new implementation that gives ~5x speedup and better scalability, while the algorithm doesn't change.
1. Use `(Int, Int, Float)` as input, saving 4 bytes on rating.
2. Out link blocks are stored as `out: Array[Array[Int]]`, where out(dstBlockId) contains the src indices (local to the src block) associated with the dstBlockId.
3. In link blocks are stored in a CSC format:
`dstEncodedIndices` contains encoded `dstBlockId` (high bits) and `dstLocalIndex` (low bits). Using this data structure, the subproblems can be solved one after another without allocating many AtA/Atb buffers.
4. The input ratings are stored in small batches to avoid ser/de overhead.
5. LAPACK's dppsv is used instead of dposv. The former only needs the triangular part. Double is used for constructing the normal equation for accuracy.
6. Use `TimSort` to create the `InBlock`.
I will share the code soon, which is a little messy at this time.