The first half of the sentence seems to be incomplete.
I think I understand what you mean - such KVs would be written by previous writer.
Missing author, date, and JIRA pointer?
I think you need to say stripe == sub-range of the region key range. You almost do. Just do it explicitly.
What does this mean "and old boundaries rarely, if ever, moving."? Give doc an edit?
Say in doc that you mean storefile metadata else it is ambiguous.
Not sure I follow here: "This compaction is performed when the number of L0 files exceeds some threshold and produces the number of files equivalent to the number of stripes, with enforced existing boundaries."
An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete
Hmm... in general I agree but we'll have to insert really good warnings everywhere. Can we detect if they delete?
Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it
Later I suppose we could have a combination of count-based and size-based.... if an edge stripe is N time bigger than any other, add a new stripe?
Yeah, it's mentioned in code comment somewhere.
I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either..... but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in.... so nevermind this suggestion.
For boundary determination it does make sense; can you point at the code? After cursory look I cannot find it.
You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this – good)
Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. "• The aggregate range of files going in must be contiguous..." Not sure I follow. Hmm... could do with ".... going into a compaction"
Yes, that was my thinking too.
"If the stripe boundaries are changed by compaction, the entire stripes with old boundaries must be replaced" ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing.
Clarified. Basically one cannot have 3 files in (-inf, 3) and 3 in [3, inf), then take 3 and 2 respectively, and rewrite them with boundary 4, because then there will be a file with [3, inf) remaining that overlaps.
I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe...
Yeah. Also wouldn't expect it to be a common case.
Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in.
Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option.
This could be done as future improvement. The implications of change of naming scheme for other parts of the systems need to be determined.
Also for all I know it might break snapshots (moving files does). And, code to figure ut stripes on the fly would be more complex.
"For empty ranges, empty files are created." Is this necessary? Would be good to avoid doing this.
Let me think about this...
The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and "bg activity" like compactions should consume as few as possible.
Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging.
Memstore for region is preserved as unified... it may be written out to multiple files indeed in future.
One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact "in time", the ones at the end of the queue will have many more store files to merge, and will use much more than their "fair-share" of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x.
In current implementation the region is limited to one compaction at a time, mostly for simplicity sake. Yes, if all stripes compact at the same time for the uniform scheme all improvement will disappear; this will have to be controlled if ability to do so is added.
You have a point that we will be making more files in the fs.
Yeah, that is inevitable.
I hear from someone from Accumulo that they have tons of files opened without any problems... it may make sense to investigate if we have problems.