Glad to see it works! FWIW, I think we could get another slight bump by making cases 1 - 8 only do a single read, and then adjusting the shift right value accordingly to filter out the extra bytes read. It would save on the bounds checks there. In my previous encoding work, we found memory access was so fast that it was better to read more than have any conditionals. The only caveat is the encoding would need to ensure there is always an extra 2 bytes at the end (so cases 3, 5 and 7 would read an extra byte, and case 6 would read 2 extra bytes).
Case 9 always requires an extra read. But really it seems like the encoder should never use a value that could cause that? If my math is correct, I believe it can only happen when bpv 57-63. But the space savings would be mostly negligible at that width compared to 64.