[CASSANDRA-5677] Performance improvements of RangeTombstones/IntervalTree - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Low
Resolution: Fixed
Fix Version/s: 1.2.7, 2.0 beta 2
Component/s: None
Labels:
None

Description

Using massively range tombstones leads to bad response time (ie 100-500 ranges tombstones per row).

After investigation, it seems that the culprit is how the DeletionInfo are merged. Each time a RangeTombstone is added into the DeletionInfo, the whole IntervalTree is rebuilt (thus, if you have 100 tombstones in one row, then 100 instances of IntervalTree are created, the first one having one interval, the second one 2 intervals,... the 100th one : 100 intervals...)

It seems that once the IntervalTree is built, it is not possible to add a new Interval. Idea is to change the implementation of the IntervalTree by another one which support "insert interval".

Attached is a proposed patch which :

renames the IntervalTree implementation to IntervalTreeCentered (the renaming is inspired from : http://en.wikipedia.org/wiki/Interval_tree)
adds a new implementation IntervalTreeAvl (which is described here : http://en.wikipedia.org/wiki/Interval_tree#Augmented_tree and here : http://en.wikipedia.org/wiki/AVL_tree )
adds a new interface IIntervalTree to abstract the implementation
adds a new configuration option (interval_tree_provider) which allows to choose between the two implementations (defaults to previous IntervalTreeCentered)
updates IntervalTreeTest unit tests to test both implementations
creates a mini benchmark between the two implementations (tree creation, point lookup, interval lookup)
creates a mini benchmark between the two implementations when merging DeletionInfo (which shows a big performance improvement when using 500 tombstones for a row)

This patch applies for 1.2 branch...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

5677-new-IntervalTree-implementation.patch
20/Jun/13 16:07
87 kB
Fabien Rousseau
5677-1.2.txt
09/Jul/13 16:37
73 kB
Sylvain Lebresne
5677-1.2.overlappingfix.txt
12/Jul/13 09:45
2 kB
Fabien Rousseau

Issue Links

is duplicated by

CASSANDRA-5748 When flushing, nodes spent almost 100% in AbstractCompositeType.compare

Resolved

CASSANDRA-5736 CQL3PagingRecordReader can OOM and kill nodes

Resolved

Activity

Ascending order - Click to sort in descending order

Jonathan Ellis added a comment - 03/Jul/13 18:00

Why would we want to retain the Centered implementation?

Jonathan Ellis added a comment - 03/Jul/13 18:00 Why would we want to retain the Centered implementation?

Sylvain Lebresne added a comment - 03/Jul/13 18:13

Why would we want to retain the Centered implementation?

The centered implementation is faster at lookups in general when there is ranges overlapping, which is exactly what we care about for sstables lookup in DataTracker so we probably want to keep it for that.

In fact, I think we should specialize things further for range tombstones, because in practice there is no real reason to keep overlapping range tombstones, so all we should keep is a list of non-overlapping ranges. Furthermore, when we read, we actually read range tombstones in sorted order (of their start), so we can do the same kind of optimization that we do in ArrayBackedSortedColumns. Also, we do more than once something like:

for (Column c : cf)
  if (cf.deletionInfo().isDeleted(c) && ...)
     ...

which for every column ends up looking up the whole deletionInfo, even though we know we will call isDeleted() in sorted order, so we could optimize that.
Lastly, our current way of handling range tombstone isn't particularly cheap in term of allocated object, and I think we can do better

But anyway, I'll post what I have in mind tomorrow (need to clean up the patch).

Sylvain Lebresne added a comment - 03/Jul/13 18:13 Why would we want to retain the Centered implementation? The centered implementation is faster at lookups in general when there is ranges overlapping, which is exactly what we care about for sstables lookup in DataTracker so we probably want to keep it for that. In fact, I think we should specialize things further for range tombstones, because in practice there is no real reason to keep overlapping range tombstones, so all we should keep is a list of non-overlapping ranges. Furthermore, when we read, we actually read range tombstones in sorted order (of their start), so we can do the same kind of optimization that we do in ArrayBackedSortedColumns. Also, we do more than once something like: for (Column c : cf) if (cf.deletionInfo().isDeleted(c) && ...) ... which for every column ends up looking up the whole deletionInfo, even though we know we will call isDeleted() in sorted order, so we could optimize that. Lastly, our current way of handling range tombstone isn't particularly cheap in term of allocated object, and I think we can do better But anyway, I'll post what I have in mind tomorrow (need to clean up the patch).

Sylvain Lebresne added a comment - 04/Jul/13 12:22

So first, let's remark how inefficient is our current use of the IntervalTree. I wrote a small benchmark test (1 node, locally, nothing fancy) that does the following:

Creates the following table: CREATE TABLE test (k int, v int, PRIMARY KEY (k, v))
Inserts N (CQL3) rows for a given (fixed) partition key (so: INSERT INTO test(k, v) VALUES (0, <n>)).
Deletes those N (CQL3) rows (DELETE FROM test WHERE k=0 AND v=<n>). This involves insert a range tombstone (because it's not a compact table).

Queries all rows for that partition key (SELECT * FROM test WHERE k=0), thus getting no results. I also did the same query in revsed order to exercise that code path too.
I ran that 10 times (with a different partition key for each run) and timed all operation. For N=2K (so pretty small), on trunk the results on my machine are:

        |         Insertions |          Deletions |              Query |     Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3418.0ms |          36950.6ms |          26100.5ms |          26147.3ms
 Run 1  |           2295.7ms |          36073.0ms |          28388.8ms |          28127.0ms
 Run 2  |           1641.2ms |          36119.4ms |          26953.1ms |          26177.8ms
 Run 3  |           1647.0ms |          30383.9ms |          28118.1ms |          27737.7ms
 Run 4  |           1472.9ms |          35913.1ms |          28172.3ms |          28046.6ms
 Run 5  |            679.8ms |          30472.8ms |          28197.5ms |          27756.0ms
 Run 6  |           1417.5ms |          30428.8ms |          28022.0ms |          27826.3ms
 Run 7  |            657.7ms |          30366.9ms |          28047.5ms |          28081.4ms
 Run 8  |            662.8ms |          30369.6ms |          28123.5ms |          27768.7ms
 Run 9  |            667.2ms |          30459.5ms |          32821.0ms |          32430.0ms
 Avg    |           1456.0ms |          32753.8ms |          28294.4ms |          28009.9ms
 8 last |           1105.8ms |          31814.3ms |          28556.9ms |          28228.1ms

Even ignoring the 2 first run (to let the JVM warm up), both deletion and query take about 30 seconds each! That's obviously very broken.

Now, Fabien's patch does fix the brokenness. After rebase to trunk (for fairness since my tests are on trunk), and for N=10K (so 8x more that the previous test, the reason I've only use 2K on bare trunk is that it's too long with 10K ) I get:

        |         Insertions |          Deletions |              Query |     Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3460.4ms |           2575.7ms |             69.7ms |             93.7ms
 Run 1  |           1223.7ms |           1772.9ms |             64.3ms |             57.4ms
 Run 2  |           1416.7ms |            744.3ms |             25.8ms |             27.9ms
 Run 3  |            673.0ms |            298.5ms |             39.3ms |             29.4ms
 Run 4  |            470.5ms |            666.8ms |             31.7ms |             25.4ms
 Run 5  |            303.0ms |            591.8ms |             34.9ms |             26.4ms
 Run 6  |            512.9ms |            293.0ms |             26.3ms |             28.1ms
 Run 7  |            437.2ms |            595.0ms |             39.0ms |             24.8ms
 Run 8  |            295.6ms |            494.2ms |             32.5ms |             23.7ms
 Run 9  |            533.8ms |            258.7ms |             32.7ms |             25.6ms
 Avg    |            932.7ms |            829.1ms |             39.6ms |             36.2ms
 8 last |            580.3ms |            492.8ms |             32.8ms |             26.4ms

So, it's sane again (the query is a lot faster than the writes because my test do the insert/deletes sequentially one at a time, I was mostly interested by read time anyway). It's worth noting that it's not really that our current "centered" interval tree implementation is bad in itself, it's just that you can't add new interval once built which make it ill-suited for range tombstones (but it's fine for our other use case of storing sstables).

However, as hinted in my previous comment, we can do better and generally improve our handling of range tombstones by using the following properties:

we don't care about overlapping range tombstone. If we have say the following range tombstones: [0, 10]@3, [5, 8]@1, [8, 15]@4 (which we currently all store as-is), then we'd be fine just storing: [0, 8]@3, [8, 15]@4. And in fact, storing the latter is more efficient (we have less ranges) and would simplify some things slightly (for the ColumnIndexer for instance, by knowing it can only have one "open" range tombstone at any time).
During reads, we'll read range tombstone in sorted order, so we can use that fact to speed up their insertion to the DeletionInfo the same way we do it in ArrayBackedSortedColumns for columns.
If we have a lot of range tombstones for a column family (which we can), the DeletionInfo can start to represent quite a lot of memory/objects, because each range tombstone is a separate object that has yet another DeletionTime object, plus the IntervalTree structure. We could do something along the lines of ~~CASSANDRA-5019~~, but it's a lot easier in that case because the use we do of range tombstone is a lot more controlled.

So, I've pushed a patch at https://github.com/pcmanus/cassandra/commits/5677 with what I have in mind. Instead of providing a generic IntervalTree implementation, it adds a specialized RangeTombstoneList (I take better name suggestions) structure just for range tombstones. That structure keeps range tombstones as a sorted list, and when adding a new range, it only adds the relevant part (it stores only [0, 8]@3, [8, 15]@4 if the 3 tombtones of my example above are added). It also tries to be reasonably memory efficient (which makes the implementation slightly more verbose that could probably be, but it's well contained in the RangeTombstoneList class so I think it's worth it overall) and optimize for the "inserts tombstone in sorted order" case. The result of the test above with that patch (N=10K to compare it to Fabien's patch):

        |         Insertions |          Deletions |              Query |     Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3567.9ms |           2766.4ms |             42.8ms |             42.4ms
 Run 1  |           1718.5ms |           1723.9ms |             62.8ms |             33.0ms
 Run 2  |           1288.7ms |            722.4ms |              6.1ms |             21.9ms
 Run 3  |            720.0ms |            363.6ms |             10.3ms |             27.4ms
 Run 4  |            602.3ms |            642.6ms |             14.0ms |             13.4ms
 Run 5  |            272.8ms |            610.8ms |              9.3ms |             12.3ms
 Run 6  |            492.2ms |            278.1ms |             12.5ms |             26.2ms
 Run 7  |            550.8ms |            621.5ms |              5.5ms |             14.1ms
 Run 8  |            278.5ms |            586.0ms |             10.3ms |             19.9ms
 Run 9  |            534.1ms |            282.8ms |             10.7ms |             26.0ms
 Avg    |           1002.6ms |            859.8ms |             18.4ms |             23.7ms
 8 last |            592.4ms |            513.5ms |              9.8ms |             20.2ms

Deletions are about as fast (maybe a few percent slower, but even that could be noise of the benchmark since it's not optimize for that part) but reads are more than 3x faster. I will note that I did not optimize for reverse queries, i.e. RangeTombstoneList always keep tombstone in comparator order, so reverse queries are hitting the worst possible case for that structure. It wouldn't be very hard to optimize for it the same way we do it in ArrayBackedSortedColumns but I'd rather keep that to a followup ticket because as can be seen above, even in the reverse case RangeTombstoneList is faster, so there is probably no big rush.

I'll note that my patch is against trunk. I'm not sure what to do for 1.2. Neither my patch nor Fabien's one are completely trivial, though at the same time the current performance is fairly bad if you have more than a few range tombstones.

Sylvain Lebresne added a comment - 04/Jul/13 12:22 So first, let's remark how inefficient is our current use of the IntervalTree. I wrote a small benchmark test (1 node, locally, nothing fancy) that does the following: Creates the following table: CREATE TABLE test (k int, v int, PRIMARY KEY (k, v)) Inserts N (CQL3) rows for a given (fixed) partition key (so: INSERT INTO test(k, v) VALUES (0, <n>)). Deletes those N (CQL3) rows (DELETE FROM test WHERE k=0 AND v=<n>). This involves insert a range tombstone (because it's not a compact table). Queries all rows for that partition key (SELECT * FROM test WHERE k=0), thus getting no results. I also did the same query in revsed order to exercise that code path too. I ran that 10 times (with a different partition key for each run) and timed all operation. For N=2K (so pretty small), on trunk the results on my machine are: | Insertions | Deletions | Query | Reversed query -------------------------------------------------------------------------------------------- Run 0 | 3418.0ms | 36950.6ms | 26100.5ms | 26147.3ms Run 1 | 2295.7ms | 36073.0ms | 28388.8ms | 28127.0ms Run 2 | 1641.2ms | 36119.4ms | 26953.1ms | 26177.8ms Run 3 | 1647.0ms | 30383.9ms | 28118.1ms | 27737.7ms Run 4 | 1472.9ms | 35913.1ms | 28172.3ms | 28046.6ms Run 5 | 679.8ms | 30472.8ms | 28197.5ms | 27756.0ms Run 6 | 1417.5ms | 30428.8ms | 28022.0ms | 27826.3ms Run 7 | 657.7ms | 30366.9ms | 28047.5ms | 28081.4ms Run 8 | 662.8ms | 30369.6ms | 28123.5ms | 27768.7ms Run 9 | 667.2ms | 30459.5ms | 32821.0ms | 32430.0ms Avg | 1456.0ms | 32753.8ms | 28294.4ms | 28009.9ms 8 last | 1105.8ms | 31814.3ms | 28556.9ms | 28228.1ms Even ignoring the 2 first run (to let the JVM warm up), both deletion and query take about 30 seconds each! That's obviously very broken. Now, Fabien's patch does fix the brokenness. After rebase to trunk (for fairness since my tests are on trunk), and for N=10K (so 8x more that the previous test, the reason I've only use 2K on bare trunk is that it's too long with 10K ) I get: | Insertions | Deletions | Query | Reversed query -------------------------------------------------------------------------------------------- Run 0 | 3460.4ms | 2575.7ms | 69.7ms | 93.7ms Run 1 | 1223.7ms | 1772.9ms | 64.3ms | 57.4ms Run 2 | 1416.7ms | 744.3ms | 25.8ms | 27.9ms Run 3 | 673.0ms | 298.5ms | 39.3ms | 29.4ms Run 4 | 470.5ms | 666.8ms | 31.7ms | 25.4ms Run 5 | 303.0ms | 591.8ms | 34.9ms | 26.4ms Run 6 | 512.9ms | 293.0ms | 26.3ms | 28.1ms Run 7 | 437.2ms | 595.0ms | 39.0ms | 24.8ms Run 8 | 295.6ms | 494.2ms | 32.5ms | 23.7ms Run 9 | 533.8ms | 258.7ms | 32.7ms | 25.6ms Avg | 932.7ms | 829.1ms | 39.6ms | 36.2ms 8 last | 580.3ms | 492.8ms | 32.8ms | 26.4ms So, it's sane again (the query is a lot faster than the writes because my test do the insert/deletes sequentially one at a time, I was mostly interested by read time anyway). It's worth noting that it's not really that our current "centered" interval tree implementation is bad in itself, it's just that you can't add new interval once built which make it ill-suited for range tombstones (but it's fine for our other use case of storing sstables). However, as hinted in my previous comment, we can do better and generally improve our handling of range tombstones by using the following properties: we don't care about overlapping range tombstone. If we have say the following range tombstones: [0, 10] @3, [5, 8] @1, [8, 15] @4 (which we currently all store as-is), then we'd be fine just storing: [0, 8] @3, [8, 15] @4. And in fact, storing the latter is more efficient (we have less ranges) and would simplify some things slightly (for the ColumnIndexer for instance, by knowing it can only have one "open" range tombstone at any time). During reads, we'll read range tombstone in sorted order, so we can use that fact to speed up their insertion to the DeletionInfo the same way we do it in ArrayBackedSortedColumns for columns. If we have a lot of range tombstones for a column family (which we can), the DeletionInfo can start to represent quite a lot of memory/objects, because each range tombstone is a separate object that has yet another DeletionTime object, plus the IntervalTree structure. We could do something along the lines of CASSANDRA-5019 , but it's a lot easier in that case because the use we do of range tombstone is a lot more controlled. So, I've pushed a patch at https://github.com/pcmanus/cassandra/commits/5677 with what I have in mind. Instead of providing a generic IntervalTree implementation, it adds a specialized RangeTombstoneList (I take better name suggestions) structure just for range tombstones. That structure keeps range tombstones as a sorted list, and when adding a new range, it only adds the relevant part (it stores only [0, 8] @3, [8, 15] @4 if the 3 tombtones of my example above are added). It also tries to be reasonably memory efficient (which makes the implementation slightly more verbose that could probably be, but it's well contained in the RangeTombstoneList class so I think it's worth it overall) and optimize for the "inserts tombstone in sorted order" case. The result of the test above with that patch (N=10K to compare it to Fabien's patch): | Insertions | Deletions | Query | Reversed query -------------------------------------------------------------------------------------------- Run 0 | 3567.9ms | 2766.4ms | 42.8ms | 42.4ms Run 1 | 1718.5ms | 1723.9ms | 62.8ms | 33.0ms Run 2 | 1288.7ms | 722.4ms | 6.1ms | 21.9ms Run 3 | 720.0ms | 363.6ms | 10.3ms | 27.4ms Run 4 | 602.3ms | 642.6ms | 14.0ms | 13.4ms Run 5 | 272.8ms | 610.8ms | 9.3ms | 12.3ms Run 6 | 492.2ms | 278.1ms | 12.5ms | 26.2ms Run 7 | 550.8ms | 621.5ms | 5.5ms | 14.1ms Run 8 | 278.5ms | 586.0ms | 10.3ms | 19.9ms Run 9 | 534.1ms | 282.8ms | 10.7ms | 26.0ms Avg | 1002.6ms | 859.8ms | 18.4ms | 23.7ms 8 last | 592.4ms | 513.5ms | 9.8ms | 20.2ms Deletions are about as fast (maybe a few percent slower, but even that could be noise of the benchmark since it's not optimize for that part) but reads are more than 3x faster. I will note that I did not optimize for reverse queries, i.e. RangeTombstoneList always keep tombstone in comparator order, so reverse queries are hitting the worst possible case for that structure. It wouldn't be very hard to optimize for it the same way we do it in ArrayBackedSortedColumns but I'd rather keep that to a followup ticket because as can be seen above, even in the reverse case RangeTombstoneList is faster, so there is probably no big rush. I'll note that my patch is against trunk. I'm not sure what to do for 1.2. Neither my patch nor Fabien's one are completely trivial, though at the same time the current performance is fairly bad if you have more than a few range tombstones.

Fabien Rousseau added a comment - 04/Jul/13 14:33

I had a quick look at the patch (I'll take more time to review it next week), and being faster is really great. Keeping only the latest range tombstone (In fact, in our use case, we often overwrite range tombstones) was something I also add in mind but kept it for later optimization : I wrongly assumed that they were kept for a real reason (like repair for example).

I definitely think your approach is better and performance numbers confirms it.
I really think that this patch should be available for 1.2 (either as a patch to apply, or directly in 1.2.X).
If in 1.2.X, an option could be added in cassandra.yaml file to switch of implementations (just rapidly checked and seen that ondisk format seems compatible...)
In 1.2.X : make the current implementation the default to avoid introducing too many changes (but users having performance trouble could still switch after doing some tests. Also note that it should be possible to log something if more than X range tombstones are read and that it is advised to switch of implementation...)
In 2.0 : make the RangeTombstoneList the default
(It is just raw ideas...)

I can try to rebase your patch for 1.2 next week if you're interested...

Fabien Rousseau added a comment - 04/Jul/13 14:33 I had a quick look at the patch (I'll take more time to review it next week), and being faster is really great. Keeping only the latest range tombstone (In fact, in our use case, we often overwrite range tombstones) was something I also add in mind but kept it for later optimization : I wrongly assumed that they were kept for a real reason (like repair for example). I definitely think your approach is better and performance numbers confirms it. I really think that this patch should be available for 1.2 (either as a patch to apply, or directly in 1.2.X). If in 1.2.X, an option could be added in cassandra.yaml file to switch of implementations (just rapidly checked and seen that ondisk format seems compatible...) In 1.2.X : make the current implementation the default to avoid introducing too many changes (but users having performance trouble could still switch after doing some tests. Also note that it should be possible to log something if more than X range tombstones are read and that it is advised to switch of implementation...) In 2.0 : make the RangeTombstoneList the default (It is just raw ideas...) I can try to rebase your patch for 1.2 next week if you're interested...

Jonathan Ellis added a comment - 04/Jul/13 16:21

Adding switches doesn't really reduce the risk, it just adds complexity.

Jonathan Ellis added a comment - 04/Jul/13 16:21 Adding switches doesn't really reduce the risk, it just adds complexity.

Sylvain Lebresne added a comment - 05/Jul/13 09:56

I agree, I also don't like the idea of adding a switch too much. We just have to decide if the performance problem is worth the risk. I'm personally slightly leaning towards yes (that it's probably worth the risk) because:

I've seen at least 3 or 4 recent reports of someone having problem with this on the mailing list (sometimes in the context of collections, sometimes not). ~~CASSANDRA-5466~~ is also almost surely a duplicate of this. I'd rather avoid having the "avoid collections, they are too inefficient" idea stick too much.
The chances to break something for people that don't have range tombstones is fairly small (it's never 0 of course, but the main change is really the backing implementation for range tombstones). And if you do use range tombstones, the current performance is so bad as soon as you have a bit too much of them that the change is high it'll become a blocker for you.

But I do admit that this kind of change is a bit bigger than what I'd like to do in a 1.2.7 release in a perfect world.

In any case, I'll attached a rebased version for 1.2 soonish. Even if we decide against committing it, it can't hurt to have it around.

Sylvain Lebresne added a comment - 05/Jul/13 09:56 I agree, I also don't like the idea of adding a switch too much. We just have to decide if the performance problem is worth the risk. I'm personally slightly leaning towards yes (that it's probably worth the risk) because: I've seen at least 3 or 4 recent reports of someone having problem with this on the mailing list (sometimes in the context of collections, sometimes not). CASSANDRA-5466 is also almost surely a duplicate of this. I'd rather avoid having the "avoid collections, they are too inefficient" idea stick too much. The chances to break something for people that don't have range tombstones is fairly small (it's never 0 of course, but the main change is really the backing implementation for range tombstones). And if you do use range tombstones, the current performance is so bad as soon as you have a bit too much of them that the change is high it'll become a blocker for you. But I do admit that this kind of change is a bit bigger than what I'd like to do in a 1.2.7 release in a perfect world. In any case, I'll attached a rebased version for 1.2 soonish. Even if we decide against committing it, it can't hurt to have it around.

Sylvain Lebresne added a comment - 05/Jul/13 18:16

Attaching patch rebased to 1.2. I'll note that the trunk patch contains 2 commits, but that rebased one is just the first of those commits. The 2nd one is a small read optimization, but it's a bit involved so it's not worth bothering with in 1.2.

Sylvain Lebresne added a comment - 05/Jul/13 18:16 Attaching patch rebased to 1.2. I'll note that the trunk patch contains 2 commits, but that rebased one is just the first of those commits. The 2nd one is a small read optimization, but it's a bit involved so it's not worth bothering with in 1.2.

Jonathan Ellis added a comment - 06/Jul/13 17:58

The chances to break something for people that don't have range tombstones is fairly small.

I think this is the most important point. I'm on board for 1.2.

Jonathan Ellis added a comment - 06/Jul/13 17:58 The chances to break something for people that don't have range tombstones is fairly small. I think this is the most important point. I'm on board for 1.2.

Michael Kjellman added a comment - 09/Jul/13 09:24

slebresne +1 on the latest patch

Michael Kjellman added a comment - 09/Jul/13 09:24 slebresne +1 on the latest patch

Fabien Rousseau added a comment - 09/Jul/13 15:46

I just had a deeper look at the patch for 1.2 (did not had time to test it yet)

It modifies more files/classes than I initially thought...

RangeTombstoneList.minMarkedAt() will always return Long.MIN_VALUE (and it's the same for RangeTombstoneList.maxMarkedAt() which will always return Long.MAX_VALUE)

Otherwise, LGTM.

I'll try to do some tests in the coming days.

Fabien Rousseau added a comment - 09/Jul/13 15:46 I just had a deeper look at the patch for 1.2 (did not had time to test it yet) It modifies more files/classes than I initially thought... RangeTombstoneList.minMarkedAt() will always return Long.MIN_VALUE (and it's the same for RangeTombstoneList.maxMarkedAt() which will always return Long.MAX_VALUE) Otherwise, LGTM. I'll try to do some tests in the coming days.

Sylvain Lebresne added a comment - 09/Jul/13 16:38

RangeTombstoneList.minMarkedAt() will always return Long.MIN_VALUE

Arf, I even told myself those were probably not worth a unit test . Anyway, patch updated with that fixed (and a unit test for min/max).

Sylvain Lebresne added a comment - 09/Jul/13 16:38 RangeTombstoneList.minMarkedAt() will always return Long.MIN_VALUE Arf, I even told myself those were probably not worth a unit test . Anyway, patch updated with that fixed (and a unit test for min/max).

Fabien Rousseau added a comment - 12/Jul/13 09:44

About minMarkedAt(), is it normal that if the list is empty, MAX_VALUE is returned ? (Even though RangeTombstoneList should probably never been empty, thus this is not a big deal)

Otherwise, I did some tests (generating random intervals, creating an IntervalTree and RangeTombstoneList and compare the results) and found a difference on a particular case (when adding two overlapping intervals in a specific order).

I attach a patch which :

adds a new unit test allowing to reproduce the problem
modifies the RangeTombstoneList to fix the problem

(Note that my patch needs your patch to be applied first, to easily see the changes)

Fabien Rousseau added a comment - 12/Jul/13 09:44 About minMarkedAt(), is it normal that if the list is empty, MAX_VALUE is returned ? (Even though RangeTombstoneList should probably never been empty, thus this is not a big deal) Otherwise, I did some tests (generating random intervals, creating an IntervalTree and RangeTombstoneList and compare the results) and found a difference on a particular case (when adding two overlapping intervals in a specific order). I attach a patch which : adds a new unit test allowing to reproduce the problem modifies the RangeTombstoneList to fix the problem (Note that my patch needs your patch to be applied first, to easily see the changes)

Sylvain Lebresne added a comment - 12/Jul/13 12:51

is it normal that if the list is empty, MAX_VALUE is returned ?

Well, it's not abnormal and this is consistent with what we do in other places for such min/max calculation (minTimestamp in SSTableWriter.appendFromStream for instance).

modifies the RangeTombstoneList to fix the problem

Good catch indeed. Are we good with that fixed?

Sylvain Lebresne added a comment - 12/Jul/13 12:51 is it normal that if the list is empty, MAX_VALUE is returned ? Well, it's not abnormal and this is consistent with what we do in other places for such min/max calculation (minTimestamp in SSTableWriter.appendFromStream for instance). modifies the RangeTombstoneList to fix the problem Good catch indeed. Are we good with that fixed?

Fabien Rousseau added a comment - 12/Jul/13 13:20

Well, it's not abnormal and this is consistent with what we do in other places for such min/max calculation (minTimestamp in SSTableWriter.appendFromStream for instance).

Are we good with that fixed?

Yep, it's ok for me

Fabien Rousseau added a comment - 12/Jul/13 13:20 Well, it's not abnormal and this is consistent with what we do in other places for such min/max calculation (minTimestamp in SSTableWriter.appendFromStream for instance). Ok Are we good with that fixed? Yep, it's ok for me

Sylvain Lebresne added a comment - 12/Jul/13 14:42

Alright, good. Committed then, thanks.

Sylvain Lebresne added a comment - 12/Jul/13 14:42 Alright, good. Committed then, thanks.

People

Assignee:: Sylvain Lebresne

Reporter:: Fabien Rousseau

Authors:: Sylvain Lebresne

Reviewers:: Fabien Rousseau

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 20/Jun/13 16:05

Updated:: 16/Apr/19 09:32

Resolved:: 12/Jul/13 14:42

Apache Cassandra

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates