Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3788

Add Kudu support for read-your-writes scan consistency

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Kudu_Impala
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend, Frontend
    • Labels:

      Description

      Impala currently distributes Kudu scan load across Kudu replicas. However, this means that it will read from replicas who may be out of date with the leader replica. This results in observable consistency abberations where the read history may be delayed from the write history.

      It would be useful to provide options or keywords in Impala so that read-your-writes consistency can be achieved.

      It is already possible to get this behavior with the Kudu C++ client API, so it's achievable today in Impala. See https://issues.apache.org/jira/browse/KUDU-1497 for the Kudu feature being tracked to make this API consistent across client implementations and more user friendly.

        Activity

        Hide
        mjacobs Matthew Jacobs added a comment -

        commit 0d4bdc1b70464e71cd3dc44f6fbaf0aa619932e0
        Author: Matthew Jacobs <mj@cloudera.com>
        Date: Tue Nov 29 15:25:40 2016 -0800

        IMPALA-3788: Add flag for Kudu read-your-writes

        The previous attempt to support for Kudu 'read-your-writes'
        consistency successfully captured the latest observed ts
        from the Kudu client after a write, and to propagate it to
        future Kudu clients within the same session. That alone made
        writes within a session linearizable, but it did not fully
        address 'read-your-writes' semantics because the Kudu client
        in the KuduScanner needed further configuration.

        The Kudu client exposes an option to set the 'ReadMode',
        which can be either READ_LATEST or READ_AT_SNAPSHOT. The
        former is the default and allows the client to read the
        latest known value for every row, and there is no
        consistency among the version of the rows read within that
        scan. When READ_AT_SNAPSHOT is enabled, the client will pick a
        ts that is after the latest observed session ts (propagated
        and set with SetLatestObservedTimestamp() by the previous
        commit for IMPALA-3788) and perform a snapshot read at that
        time. This timestamp is still determined per-client, so that
        does not mean that the entire query performs a snapshot read
        at the same timestamp-- doing that requires further work
        in Kudu and will require another change in Impala as well.

        That said, this behavior is sufficient to satisfy
        'read-your-writes' consistency in all cases except when a
        DML statement is reading and writing the same table, e.g.
        INSERT INTO foo SELECT ... from foo
        This case may result in reading rows that were inserted by a
        different node of the same query. This case will be handled
        when a global snapshot timestamp is supported and configured
        by Impala.

        Because this is performing a snapshot read, some rows may be
        read from lagging replicas and thus those replicas will have
        to wait before returning rows. This has implications for
        the query execution behavior (e.g. queries may be more
        likely to time out, may affect number of queries that can be
        run), so the behavior is not yet enabled by default. It can
        be enabled with the flag --kudu_read_mode READ_AT_SNAPSHOT
        The goal is to make this the default behavior after
        sufficient testing.

        Change-Id: I003aba410548bc9158d1e11abbdcf710c31a82ff
        Reviewed-on: http://gerrit.cloudera.org:8080/5288
        Reviewed-by: Matthew Jacobs <mj@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        mjacobs Matthew Jacobs added a comment - commit 0d4bdc1b70464e71cd3dc44f6fbaf0aa619932e0 Author: Matthew Jacobs <mj@cloudera.com> Date: Tue Nov 29 15:25:40 2016 -0800 IMPALA-3788 : Add flag for Kudu read-your-writes The previous attempt to support for Kudu 'read-your-writes' consistency successfully captured the latest observed ts from the Kudu client after a write, and to propagate it to future Kudu clients within the same session. That alone made writes within a session linearizable, but it did not fully address 'read-your-writes' semantics because the Kudu client in the KuduScanner needed further configuration. The Kudu client exposes an option to set the 'ReadMode', which can be either READ_LATEST or READ_AT_SNAPSHOT. The former is the default and allows the client to read the latest known value for every row, and there is no consistency among the version of the rows read within that scan. When READ_AT_SNAPSHOT is enabled, the client will pick a ts that is after the latest observed session ts (propagated and set with SetLatestObservedTimestamp() by the previous commit for IMPALA-3788 ) and perform a snapshot read at that time. This timestamp is still determined per-client, so that does not mean that the entire query performs a snapshot read at the same timestamp-- doing that requires further work in Kudu and will require another change in Impala as well. That said, this behavior is sufficient to satisfy 'read-your-writes' consistency in all cases except when a DML statement is reading and writing the same table, e.g. INSERT INTO foo SELECT ... from foo This case may result in reading rows that were inserted by a different node of the same query. This case will be handled when a global snapshot timestamp is supported and configured by Impala. Because this is performing a snapshot read, some rows may be read from lagging replicas and thus those replicas will have to wait before returning rows. This has implications for the query execution behavior (e.g. queries may be more likely to time out, may affect number of queries that can be run), so the behavior is not yet enabled by default. It can be enabled with the flag --kudu_read_mode READ_AT_SNAPSHOT The goal is to make this the default behavior after sufficient testing. Change-Id: I003aba410548bc9158d1e11abbdcf710c31a82ff Reviewed-on: http://gerrit.cloudera.org:8080/5288 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins
        Hide
        mjacobs Matthew Jacobs added a comment -

        Re-opening after discussing with David Alves. The first commit is most of the work, but there's a bit more to do here to really get 'read-your-writes' consistency from Kudu:
        1) Impala needs to set the read consistency mode to READ SNAPSHOT
        2) have the coordinator take the max of a the session's last observed kudu timestamp on the coordinator (added in the previous commit) and the timestamp on the coordinator (which needs some bits masked to be used as a kudu hybridtimestamp).

        Without doing #1 above, the last observed kudu timestamp (from the first commit) only affects write consistency (which makes writes linearizable). When the mode is changed to SNAPSHOT, it will read at that snapshot but the time may be far in the past. #2 will ensure a more recent, globally consistent snapshot is read.

        Show
        mjacobs Matthew Jacobs added a comment - Re-opening after discussing with David Alves . The first commit is most of the work, but there's a bit more to do here to really get 'read-your-writes' consistency from Kudu: 1) Impala needs to set the read consistency mode to READ SNAPSHOT 2) have the coordinator take the max of a the session's last observed kudu timestamp on the coordinator (added in the previous commit) and the timestamp on the coordinator (which needs some bits masked to be used as a kudu hybridtimestamp). Without doing #1 above, the last observed kudu timestamp (from the first commit) only affects write consistency (which makes writes linearizable). When the mode is changed to SNAPSHOT, it will read at that snapshot but the time may be far in the past. #2 will ensure a more recent, globally consistent snapshot is read.
        Hide
        mjacobs Matthew Jacobs added a comment -

        commit c01644bcb9746c440cc6fd425a564ec40ea6d27c
        Author: Matthew Jacobs <mj@cloudera.com>
        Date: Thu Oct 20 15:21:53 2016 -0700

        IMPALA-3788: Support for Kudu 'read-your-writes' consistency

        Kudu provides an API to get/set a 'latest observed
        timestamp' on clients to allow a client which inserts to
        capture and send this timestamp to another client before a
        read to ensure that data as of that timestamp is visible.
        This adds support for this feature _for reads within a
        session_ by capturing the latest observed timestamp when the
        KuduTableSink is sending its last update to the coordinator.
        The timestamp is sent with other post-write information, and
        is aggregated (i.e. taking the max) at the coordinator. The
        max is stored in the session, and that value is then set in
        the Kudu client on future scans.

        This is being tested by running the Kudu tests after
        removing delays that were introduced to work around the
        issue that reads might not be visible after a write. Before
        this change, if there were no delay, inconsistent results
        could be returned.

        Change-Id: I6bcb5fc218ad4ab935343a55b2188441d8c7dfbd
        Reviewed-on: http://gerrit.cloudera.org:8080/4779
        Reviewed-by: Matthew Jacobs <mj@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        mjacobs Matthew Jacobs added a comment - commit c01644bcb9746c440cc6fd425a564ec40ea6d27c Author: Matthew Jacobs <mj@cloudera.com> Date: Thu Oct 20 15:21:53 2016 -0700 IMPALA-3788 : Support for Kudu 'read-your-writes' consistency Kudu provides an API to get/set a 'latest observed timestamp' on clients to allow a client which inserts to capture and send this timestamp to another client before a read to ensure that data as of that timestamp is visible. This adds support for this feature _for reads within a session_ by capturing the latest observed timestamp when the KuduTableSink is sending its last update to the coordinator. The timestamp is sent with other post-write information, and is aggregated (i.e. taking the max) at the coordinator. The max is stored in the session, and that value is then set in the Kudu client on future scans. This is being tested by running the Kudu tests after removing delays that were introduced to work around the issue that reads might not be visible after a write. Before this change, if there were no delay, inconsistent results could be returned. Change-Id: I6bcb5fc218ad4ab935343a55b2188441d8c7dfbd Reviewed-on: http://gerrit.cloudera.org:8080/4779 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            mjacobs Matthew Jacobs
            Reporter:
            mpercy Mike Percy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development