Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Kudu_Impala
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      Kudu supports writing with AUTO_FLUSH_BACKGROUND mode as of Kudu 1.0 for flushing buffered write operations:

      See https://github.com/apache/kudu/blob/branch-1.0.x/src/kudu/client/client.h#L1157

      This may improve performance in some cases, so we should test this and consider switching.

      From Alexey Serbin:

      I did my testing with simple 'push-as-mush-as-client-can-do' scenarios, and results look good (will share a link to the performance summary soon): the new code performs comparable with the old one if both run in MANUAL_FLUSH mode. And of course, a session in AUTO_FLUSH_BACKGROUND mode perform much better that session in AUTO_FLUSH_SYNC mode. Also, session in AUTO_FLUSH_BACKGROUND mode performs better than session in MANUAL_FLUSH mode if the buffer of the former allows to accommodate more operations than the latter flushes time to time.

        Activity

        Show
        mjacobs Matthew Jacobs added a comment - some notes on from Alexey: https://gist.github.com/alexeyserbin/35b9eac889c6f2586d58c1fe0c2b3afd
        Hide
        mjacobs Matthew Jacobs added a comment -

        commit 99ed6dc67ae889eb2a45b10c97cb23f52bc83e5d
        Author: Matthew Jacobs <mj@cloudera.com>
        Date: Wed Oct 19 15:30:58 2016 -0700

        IMPALA-4134,IMPALA-3704: Kudu INSERT improvements

        1.) IMPALA-4134: Use Kudu AUTO FLUSH
        Improves performance of writes to Kudu up to 4.2x in
        bulk data loading tests (load 200 million rows from
        lineitem).

        2.) IMPALA-3704: Improve errors on PK conflicts
        The Kudu client reports an error for every PK conflict,
        and all errors were being returned in the error status.
        As a result, inserts/updates/deletes could return errors
        with thousands errors reported. This changes the error
        handling to log all reported errors as warnings and
        return only the first error in the query error status.

        3.) Improve the DataSink reporting of the insert stats.
        The per-partition stats returned by the data sink weren't
        useful for Kudu sinks. Firstly, the number of appended rows
        was not being displayed in the profile. Secondly, the
        'stats' field isn't populated for Kudu tables and thus was
        confusing in the profile, so it is no longer printed if it
        is not set in the thrift struct.

        Testing: Ran local tests, including new tests to verify
        the query profile insert stats. Manual cluster testing was
        conducted of the AUTO FLUSH functionality, and that testing
        informed the default mutation buffer value of 100MB which
        was found to provide good results.

        Change-Id: I5542b9a061b01c543a139e8722560b1365f06595
        Reviewed-on: http://gerrit.cloudera.org:8080/4728
        Reviewed-by: Matthew Jacobs <mj@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        mjacobs Matthew Jacobs added a comment - commit 99ed6dc67ae889eb2a45b10c97cb23f52bc83e5d Author: Matthew Jacobs <mj@cloudera.com> Date: Wed Oct 19 15:30:58 2016 -0700 IMPALA-4134 , IMPALA-3704 : Kudu INSERT improvements 1.) IMPALA-4134 : Use Kudu AUTO FLUSH Improves performance of writes to Kudu up to 4.2x in bulk data loading tests (load 200 million rows from lineitem). 2.) IMPALA-3704 : Improve errors on PK conflicts The Kudu client reports an error for every PK conflict, and all errors were being returned in the error status. As a result, inserts/updates/deletes could return errors with thousands errors reported. This changes the error handling to log all reported errors as warnings and return only the first error in the query error status. 3.) Improve the DataSink reporting of the insert stats. The per-partition stats returned by the data sink weren't useful for Kudu sinks. Firstly, the number of appended rows was not being displayed in the profile. Secondly, the 'stats' field isn't populated for Kudu tables and thus was confusing in the profile, so it is no longer printed if it is not set in the thrift struct. Testing: Ran local tests, including new tests to verify the query profile insert stats. Manual cluster testing was conducted of the AUTO FLUSH functionality, and that testing informed the default mutation buffer value of 100MB which was found to provide good results. Change-Id: I5542b9a061b01c543a139e8722560b1365f06595 Reviewed-on: http://gerrit.cloudera.org:8080/4728 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            mjacobs Matthew Jacobs
            Reporter:
            mjacobs Matthew Jacobs
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development