Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      currently if a row contains > 1000 columns, the run time becomes considerably slow (my test of
      a row with 30 00 columns (standard, regular) each with 8 bytes in name, and 40 bytes in value, is about 16ms.
      this is all running in memory, no disk read is involved.

      through debugging we can find
      most of this time is spent on
      [Wall Time] org.apache.cassandra.db.Table.getRow(QueryFilter)
      [Wall Time] org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, ColumnFamily)
      [Wall Time] org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int, ColumnFamily)
      [Wall Time] org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter, int, ColumnFamily)
      [Wall Time] org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily, Iterator, int)
      [Wall Time] org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer, Iterator, int)
      [Wall Time] org.apache.cassandra.db.ColumnFamily.addColumn(IColumn)

      ColumnFamily.addColumn() is slow because it inserts into an internal concurrentSkipListMap() that maps column names to values.
      this structure is slow for two reasons: it needs to do synchronization; it needs to maintain a more complex structure of map.

      but if we look at the whole read path, thrift already defines the read output to be List<ColumnOrSuperColumn> so it does not make sense to use a luxury map data structure in the interium and finally convert it to a list. on the synchronization side, since the return CF is never going to be shared/modified by other threads, we know the access is always single thread, so no synchronization is needed.

      but these 2 features are indeed needed for ColumnFamily in other cases, particularly write. so we can provide a different ColumnFamily to CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always creates the standard ColumnFamily, but take a provided returnCF, whose cost is much cheaper.

      the provided patch is for demonstration now, will work further once we agree on the general direction.
      CFS, ColumnFamily, and Table are changed; a new FastColumnFamily is provided. the main work is to let the FastColumnFamily use an array for internal storage. at first I used binary search to insert new columns in addColumn(), but later I found that even this is not necessary, since all calling scenarios of ColumnFamily.addColumn() has an invariant that the inserted columns come in sorted order (I still have an issue to resolve descending or ascending now, but ascending works). so the current logic is simply to compare the new column against the end column in the array, if names not equal, append, if equal, reconcile.

      slight temporary hacks are made on getTopLevelColumnFamily so we have 2 flavors of the method, one accepting a returnCF. but we could definitely think about what is the better way to provide this returnCF.

      this patch compiles fine, no tests are provided yet. but I tested it in my application, and the performance improvement is dramatic: it offers about 50% reduction in read time in the 3000-column case.

      thanks
      Yang

      1. 2843.patch
        47 kB
        Sylvain Lebresne
      2. microBenchmark.patch
        9 kB
        Sylvain Lebresne
      3. std_timing
        8 kB
        Yang Yang
      4. patch_timing
        8 kB
        Yang Yang
      5. 2843_d.patch
        53 kB
        Yang Yang
      6. fix.diff
        1 kB
        Yang Yang
      7. 2843_g.patch
        71 kB
        Sylvain Lebresne
      8. 2843_h.patch
        69 kB
        Sylvain Lebresne

        Activity

        Gavin made changes -
        Workflow patch-available, re-open possible [ 12751769 ] reopen-resolved, no closed status, patch-avail, testing [ 12755052 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12618786 ] patch-available, re-open possible [ 12751769 ]
        Sylvain Lebresne made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Assignee Sylvain Lebresne [ slebresne ]
        Reviewer jbellis
        Resolution Fixed [ 1 ]
        Sylvain Lebresne made changes -
        Attachment 2843_h.patch [ 12489824 ]
        Sylvain Lebresne made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 1.0 [ 12316349 ]
        Sylvain Lebresne made changes -
        Attachment 2843_g.patch [ 12489479 ]
        Sylvain Lebresne made changes -
        Attachment 2843_f.patch [ 12487724 ]
        Yang Yang made changes -
        Attachment fix.diff [ 12489263 ]
        Sylvain Lebresne made changes -
        Attachment 2843_f.patch [ 12487724 ]
        Sylvain Lebresne made changes -
        Attachment 2843_e.patch [ 12487717 ]
        Sylvain Lebresne made changes -
        Attachment 2843_e.patch [ 12487717 ]
        Yang Yang made changes -
        Comment [ the DeletionInfo private=>protected change is moved to https://issues.apache.org/jira/browse/CASSANDRA-2937

        new patch uploaded here ]
        Yang Yang made changes -
        Comment [ the DeletionInfo private=>protected change is moved to https://issues.apache.org/jira/browse/CASSANDRA-2937

        new patch uploaded here ]
        Yang Yang made changes -
        Comment [ the DeletionInfo private=>protected change is moved to https://issues.apache.org/jira/browse/CASSANDRA-2937

        new patch uploaded here ]
        Yang Yang made changes -
        Attachment incremental.diff [ 12486140 ]
        Yang Yang made changes -
        Attachment fast_cf_081_trunk.diff [ 12484900 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487377 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487378 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487379 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487380 ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486507 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487379 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487378 ]
        Yang Yang made changes -
        Attachment 2843_d.patch [ 12487377 ]
        Yang Yang made changes -
        Attachment patch_timing [ 12487341 ]
        Yang Yang made changes -
        Attachment std_timing [ 12487340 ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486507 ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486493 ]
        Yang Yang made changes -
        Comment [

        rebased against against HEAD of trunk (4629648899e637e8e03938935f126689cce5ad48)


        also fixed a bug in my newly added test; also the DeletionInfo class in AbstractColumnContainer somehow gives compile error in eclipse, had to change that into protected. ]
        Yang Yang made changes -
        Comment [ rebased , against HEAD of trunk (4629648899e637e8e03938935f126689cce5ad48) ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486493 ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486486 ]
        Yang Yang made changes -
        Attachment 2843_c.patch [ 12486486 ]
        Yang Yang made changes -
        Attachment 2843_b.patch [ 12486139 ]
        Yang Yang made changes -
        Attachment incremental.diff [ 12486140 ]
        Yang Yang made changes -
        Attachment 2843_b.patch [ 12486139 ]
        Yang Yang made changes -
        Comment [ i just got down the patch and transfered it to a computer to read
        Sylvain's approach to compare the last element is quite clean i see no
        problems

        the only problem was due to me: the bin search high=mid-1 should be changed
        to high=mid
        also with this error fixed , you dont need to special case 1 2 in the end of
        binsearch


        ]
        Yang Yang made changes -
        Comment [ My comment before last was not quite correct

        The array implementation with binary search also works without special
        assumptions. Just that with assumption alot futher gains
        https://issues.apache.org/jira/browse/CASSANDRA-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
        ColumnFamily during is a good idea. It is clear that avoiding
        synchronization will be faster, and given the type of operations we do
        during reads (insertion in sorted order and iteration), an ArrayList backed
        solution is sure to be faster too. I will also be much gentle on the GC that
        the linked list ConcurrentSkipListMap uses. I think that all those will help
        even with relatively small reads. So let's focus on that for this ticket and
        let other potential improvement to other ticket, especially if it is unclear
        they bear any noticeable speedup.
        is quite frankly ugly and will be a maintenance nightmare (you'll have to
        check you did overwrite every function that touch the map (which is not the
        case in the patch) and every update to ColumnFamily have to be aware that it
        should update FastColumnFamily as well).
        functionnal ColumnFamily implementation (albeit not synchronized). That is,
        we can't assume that addition will always be in strict increasing order,
        otherwise again this will be too hard to use.
        Granted, I don't think it is used in the read path, but I think that the new
        ColumnFamily implementation could advantageously be used during compaction
        (by preCompactedRow typically, and possibly other places where concurrent
        access is not an issue) where this would matter.
        the remarks above. The patch is against trunk (not 0.8 branch), because it
        build on the recently committed refactor of ColumnFamily. It refactors
        ColumnFamily (AbstractColumnContainer actually) to allow for a pluggable
        backing column map. The ConcurrentSkipListMap implemn is name
        ThreadSafeColumnMap and the new one is called ArrayBackedColumnMap (which I
        prefer to FastSomething since it's not a very helpful name).
        getTopLevelColumns, I pass along a factory (that each backing implementation
        provides). The main goal was to avoid creating a columnFamily when it's
        useless (if row cache is enabled on the CF -- btw, this ticket only improve
        on read for column family with no cache).
        (addition of column + iteration), the ArrayBacked implementation is faster
        than the ConcurrentSkipListMap based one. Interestingly though, this is
        mainly true when some reconciliation of columns happens. That is, if you
        only add columns with different names, the ArrayBacked implementation is
        faster, but not dramatically so. If you start adding column that have to be
        resolved, the ArrayBacked implementation becomes much faster, even with a
        reasonably small number of columns (inserting 100 columns with only 10
        unique column names, the ArrayBacked is already >30% faster). And this
        mostly due to the overhead of synchronization (of replace()): a TreeMap
        based implementation is slightly slower than the ArrayBacked one but not by
        a lot and thus is much faster than the ConcurrentSkipListMap implementation.
        use a few unit test for the new ArrayBacked implementation).
        considerably slow (my test of
        and 40 bytes in value, is about 16ms.
        org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter,
        ColumnFamily)
        org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int,
        ColumnFamily)
        org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter,
        int, ColumnFamily)
        org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily,
        Iterator, int)
        org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
        Iterator, int)
        concurrentSkipListMap() that maps column names to values.
        it needs to maintain a more complex structure of map.
        output to be List<ColumnOrSuperColumn> so it does not make sense to use a
        luxury map data structure in the interium and finally convert it to a list.
        on the synchronization side, since the return CF is never going to be
        shared/modified by other threads, we know the access is always single
        thread, so no synchronization is needed.
        particularly write. so we can provide a different ColumnFamily to
        CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always
        creates the standard ColumnFamily, but take a provided returnCF, whose cost
        is much cheaper.
        agree on the general direction.
        provided. the main work is to let the FastColumnFamily use an array for
        internal storage. at first I used binary search to insert new columns in
        addColumn(), but later I found that even this is not necessary, since all
        calling scenarios of ColumnFamily.addColumn() has an invariant that the
        inserted columns come in sorted order (I still have an issue to resolve
        descending or ascending now, but ascending works). so the current logic is
        simply to compare the new column against the end column in the array, if
        names not equal, append, if equal, reconcile.
        flavors of the method, one accepting a returnCF. but we could definitely
        think about what is the better way to provide this returnCF.
        my application, and the performance improvement is dramatic: it offers about
        50% reduction in read time in the 3000-column case.
        ]
        Yang Yang made changes -
        Comment [ Thanks Sylvain

        Its great to see this taking shape

        Im on vacation with a pbone so very cursory comments

        " this implementation should not expect the input always come in already
        sorted order" then probably it would be difficult to achieve a lot of per
        gain easily. Mainly there are two areas of attack. Synch and cheaper data
        structure. Actually during my tests I found that sync is not the main issue:
        going from CSLM to treemap is actually slower. Your tests show different so
        we'd better confirm. If I were correct in tests. Then that means not
        assuming special circumstances would be like finding a generic silver bullet
        which is very difficult
        https://issues.apache.org/jira/browse/CASSANDRA-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
        ColumnFamily during is a good idea. It is clear that avoiding
        synchronization will be faster, and given the type of operations we do
        during reads (insertion in sorted order and iteration), an ArrayList backed
        solution is sure to be faster too. I will also be much gentle on the GC that
        the linked list ConcurrentSkipListMap uses. I think that all those will help
        even with relatively small reads. So let's focus on that for this ticket and
        let other potential improvement to other ticket, especially if it is unclear
        they bear any noticeable speedup.
        is quite frankly ugly and will be a maintenance nightmare (you'll have to
        check you did overwrite every function that touch the map (which is not the
        case in the patch) and every update to ColumnFamily have to be aware that it
        should update FastColumnFamily as well).
        functionnal ColumnFamily implementation (albeit not synchronized). That is,
        we can't assume that addition will always be in strict increasing order,
        otherwise again this will be too hard to use.
        Granted, I don't think it is used in the read path, but I think that the new
        ColumnFamily implementation could advantageously be used during compaction
        (by preCompactedRow typically, and possibly other places where concurrent
        access is not an issue) where this would matter.
        the remarks above. The patch is against trunk (not 0.8 branch), because it
        build on the recently committed refactor of ColumnFamily. It refactors
        ColumnFamily (AbstractColumnContainer actually) to allow for a pluggable
        backing column map. The ConcurrentSkipListMap implemn is name
        ThreadSafeColumnMap and the new one is called ArrayBackedColumnMap (which I
        prefer to FastSomething since it's not a very helpful name).
        getTopLevelColumns, I pass along a factory (that each backing implementation
        provides). The main goal was to avoid creating a columnFamily when it's
        useless (if row cache is enabled on the CF -- btw, this ticket only improve
        on read for column family with no cache).
        (addition of column + iteration), the ArrayBacked implementation is faster
        than the ConcurrentSkipListMap based one. Interestingly though, this is
        mainly true when some reconciliation of columns happens. That is, if you
        only add columns with different names, the ArrayBacked implementation is
        faster, but not dramatically so. If you start adding column that have to be
        resolved, the ArrayBacked implementation becomes much faster, even with a
        reasonably small number of columns (inserting 100 columns with only 10
        unique column names, the ArrayBacked is already >30% faster). And this
        mostly due to the overhead of synchronization (of replace()): a TreeMap
        based implementation is slightly slower than the ArrayBacked one but not by
        a lot and thus is much faster than the ConcurrentSkipListMap implementation.
        use a few unit test for the new ArrayBacked implementation).
        considerably slow (my test of
        and 40 bytes in value, is about 16ms.
        org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter,
        ColumnFamily)
        org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int,
        ColumnFamily)
        org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter,
        int, ColumnFamily)
        org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily,
        Iterator, int)
        org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
        Iterator, int)
        concurrentSkipListMap() that maps column names to values.
        it needs to maintain a more complex structure of map.
        output to be List<ColumnOrSuperColumn> so it does not make sense to use a
        luxury map data structure in the interium and finally convert it to a list.
        on the synchronization side, since the return CF is never going to be
        shared/modified by other threads, we know the access is always single
        thread, so no synchronization is needed.
        particularly write. so we can provide a different ColumnFamily to
        CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always
        creates the standard ColumnFamily, but take a provided returnCF, whose cost
        is much cheaper.
        agree on the general direction.
        provided. the main work is to let the FastColumnFamily use an array for
        internal storage. at first I used binary search to insert new columns in
        addColumn(), but later I found that even this is not necessary, since all
        calling scenarios of ColumnFamily.addColumn() has an invariant that the
        inserted columns come in sorted order (I still have an issue to resolve
        descending or ascending now, but ascending works). so the current logic is
        simply to compare the new column against the end column in the array, if
        names not equal, append, if equal, reconcile.
        flavors of the method, one accepting a returnCF. but we could definitely
        think about what is the better way to provide this returnCF.
        my application, and the performance improvement is dramatic: it offers about
        50% reduction in read time in the 3000-column case.
        ]
        Sylvain Lebresne made changes -
        Attachment microBenchmark.patch [ 12485396 ]
        Sylvain Lebresne made changes -
        Attachment 2843.patch [ 12485290 ]
        Yang Yang made changes -
        Attachment fast_cf_081_trunk.diff [ 12484900 ]
        Yang Yang made changes -
        Attachment fast_cf.diff [ 12484852 ]
        Yang Yang made changes -
        Attachment b.tar.gz [ 12484853 ]
        Yang Yang made changes -
        Attachment b.tar.gz [ 12484853 ]
        Yang Yang made changes -
        Field Original Value New Value
        Attachment fast_cf.diff [ 12484852 ]
        Yang Yang created issue -

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Yang Yang
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development