Currently Oak Lucene support would copy index files to local file system as part of CopyOnRead feature. In one of the setup it has been observed that index logic was failing with following error
Here size of _2ala.cfs differed from remote copy and possible other index file may have same size but different content. Comparing the modified time of the files with those in Oak it can be seen that one of file system was older than one in Oak
And on same setup the system did saw a rollback in segment node store
So one possible cause is that
- At some time earlier to 17:17 lucene index got updated and _2ala.cfs got created.
- Post update the head revision in Segment store was updated but the revision yet to made it to journal log
- Lucene CopyOnRead logic got event for the change and copied the file
- System crashed and hence journal did not got updated
- System restarted and per last entry in journal system suffered with some "data loss" and hence index checkpoint also moved back
- As checkpoint got reverted index started at earlier state and hence created a file with same name _2ala.cfs
- CopyOnRead detected file length change and logged a warning routing call to remote
- However other files like _2ala.si, _2ala.cfe which were created in same commit had same size but likely different content which later cause lucene query to start failing
In such a case a restart after cleaning the existing index content would have brought back the system to normal state.
So as a fix we would need to come up with some sanity check at time of system startup