[HDFS-12943] Consistent Reads from Standby Node - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.10.0, 3.3.0, 3.1.4, 3.2.2
Component/s: hdfs
Labels:
None

Target Version/s:

3.3.0
Hadoop Flags:

Reviewed
Release Note:

Hide
Observer is a new type of a NameNode in addition to Active and Standby Nodes in HA settings. An Observer Node maintains a replica of the namespace same as a Standby Node. It additionally allows execution of clients read requests.

To ensure read-after-write consistency within a single client, a state ID is introduced in RPC headers. The Observer responds to the client request only after its own state has caught up with the client’s state ID, which it previously received from the Active NameNode.

Clients can explicitly invoke a new client protocol call msync(), which ensures that subsequent reads by this client from an Observer are consistent.

A new client-side ObserverReadProxyProvider is introduced to provide automatic switching between Active and Observer NameNodes for submitting respectively write and read requests.

Show
Observer is a new type of a NameNode in addition to Active and Standby Nodes in HA settings. An Observer Node maintains a replica of the namespace same as a Standby Node. It additionally allows execution of clients read requests. To ensure read-after-write consistency within a single client, a state ID is introduced in RPC headers. The Observer responds to the client request only after its own state has caught up with the client’s state ID, which it previously received from the Active NameNode. Clients can explicitly invoke a new client protocol call msync(), which ensures that subsequent reads by this client from an Observer are consistent. A new client-side ObserverReadProxyProvider is introduced to provide automatic switching between Active and Observer NameNodes for submitting respectively write and read requests.

Description

StandbyNode in HDFS is a replica of the active NameNode. The states of the NameNodes are coordinated via the journal. It is natural to consider StandbyNode as a read-only replica. As with any replicated distributed system the problem of stale reads should be resolved. Our main goal is to provide reads from standby in a consistent way in order to enable a wide range of existing applications running on top of HDFS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ConsistentReadsFromStandbyNode.pdf
19/Dec/17 20:17
394 kB
Konstantin Shvachko
ConsistentReadsFromStandbyNode.pdf
23/Mar/18 21:36
396 kB
Konstantin Shvachko
TestPlan-ConsistentReadsFromStandbyNode.pdf
04/Sep/18 23:36
79 kB
Konstantin Shvachko
HDFS-12943-001.patch
06/Dec/18 00:33
328 kB
Konstantin Shvachko
HDFS-12943-002.patch
15/Dec/18 02:07
354 kB
Konstantin Shvachko
HDFS-12943-003.patch
22/Dec/18 00:33
353 kB
Konstantin Shvachko
HDFS-12943-004.patch
24/Dec/18 18:12
353 kB
Konstantin Shvachko

Issue Links

breaks

HDFS-14435 ObserverReadProxyProvider is unable to properly fetch HAState from Standby NNs

Resolved

incorporates

HDFS-15751 Add documentation for msync() API to filesystem.md

Resolved

is related to

HDFS-14245 Class cast error in GetGroups with ObserverReadProxyProvider

Resolved

HDFS-13664 Refactor ConfiguredFailoverProxyProvider to make inheritance easier

Resolved

HDFS-10702 Add a Client API and Proxy Provider to enable stale read from Standby

Open

HDFS-6440 Support more than 2 NameNodes

Resolved

HADOOP-17477 [SBN read] Implement msync() for ViewFS

Open

HDFS-14205 Backport HDFS-6440 to branch-2

Resolved

HDFS-10519 Add a configuration option to enable in-progress edit log tailing

Resolved

HDFS-13735 Make QJM HTTP URL connection timeout configurable

Resolved

HDFS-13814 Remove super user privilege requirement for NameNode.getServiceStatus

Resolved

relates to

HDFS-14272 [SBN read] ObserverReadProxyProvider should sync with active txnID on startup

Resolved

HDFS-14279 [SBN Read] Race condition in ObserverReadProxyProvider

Resolved

HDFS-14347 Restore a comment line mistakenly removed in ProtobufRpcEngine

Resolved

HDFS-14204 Backport HDFS-12943 to branch-2

Resolved

HDFS-14211 [Consistent Observer Reads] Allow for configurable "always msync" mode

Resolved

HDFS-14250 [SBN read] msync should sync with active NameNode to fetch the latest stateID

Resolved

HDFS-14271 [SBN read] StandbyException is logged if Observer is the first NameNode

Patch Available

HDFS-14573 Backport Standby Read to branch-3

Resolved

(6 is related to, 8 relates to)

Sub-Tasks

Tailing edits should not update quota counts on ObserverNode

Resolved

Erik Krogen

Changes to the NameNode to support reads from standby

Resolved

Chao Sun

Introduce ObserverReadProxyProvider

Resolved

Chao Sun

[Edit Tail Fast Path] Allow SbNN to tail in-progress edits from JN via RPC

Resolved

Erik Krogen

Make Client field AlignmentContext non-static.

Resolved

Plamen Jeliazkov

Add stateId to RPC headers.

Resolved

Plamen Jeliazkov

Fine-grained locking while consuming journal stream.

Resolved

Konstantin Shvachko

StandbyNode should upload FsImage to ObserverNode after checkpointing.

Resolved

Chen Liang

Add haadmin commands to transition between standby and observer

Resolved

Chao Sun

10.

Support observer reads for WebHDFS

Open

Chao Sun

11.

Allow Observer to participate in NameNode failover

Open

Unassigned

12.

Standby NameNode should roll active edit log when checkpointing

Resolved

Unassigned

13.

Add lastSeenStateId to RpcRequestHeader.

Resolved

Plamen Jeliazkov

14.

HDFS-13522: Add federated nameservices states to client protocol and propagate it between routers and clients.

Resolved

Simbarashe Dzinamarira

100%

15.

Support observer nodes in MiniDFSCluster

Resolved

Konstantin Shvachko

16.

Add ReadOnly annotation to methods in ClientProtocol

Resolved

Chao Sun

17.

[Edit Tail Fast Path Pt 1] Enhance JournalNode with an in-memory cache of recent edit transactions

Resolved

Erik Krogen

18.

[Edit Tail Fast Path Pt 2] Add ability for JournalNode to serve edits via RPC

Resolved

Erik Krogen

19.

[Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

Resolved

Erik Krogen

20.

[Edit Tail Fast Path Pt 4] Cleanup: integration test, documentation, remove unnecessary dummy sync

Resolved

Erik Krogen

21.

Move RPC response serialization into Server.doResponse

Resolved

Plamen Jeliazkov

22.

Introduce msync API call

Resolved

Chen Liang

23.

NameNodeRpcServer getEditsFromTxid assumes it is run on active NameNode

Open

Unassigned

24.

ClientGCIContext should be correctly named ClientGSIContext

Resolved

Konstantin Shvachko

25.

Use getServiceStatus to discover observer namenodes

Resolved

Chao Sun

26.

Add msync server implementation.

Resolved

Chen Liang

27.

TestStateAlignmentContextWithHA should use real ObserverReadProxyProvider instead of AlignmentContextProxyProvider.

Resolved

Plamen Jeliazkov

28.

Implement performFailover logic for ObserverReadProxyProvider.

Resolved

Erik Krogen

29.

Postpone NameNode state discovery in ObserverReadProxyProvider until the first real RPC call.

Resolved

Chen Liang

30.

Unit tests for standby reads.

Resolved

Unassigned

31.

ObserverReadProxyProvider should work with IPFailoverProxyProvider

Resolved

Konstantin Shvachko

32.

Reduce logging frequency of QuorumJournalManager#selectInputStreams

Resolved

Erik Krogen

33.

Limit logging frequency of edit tail related statements

Resolved

Erik Krogen

34.

Refactor NameNode failover proxy providers

Resolved

Konstantin Shvachko

35.

Remove AlignmentContext from AbstractNNFailoverProxyProvider

Resolved

Konstantin Shvachko

36.

Only some protocol methods should perform msync wait

Resolved

Erik Krogen

37.

ObserverNode should reject read requests when it is too far behind.

Resolved

Konstantin Shvachko

38.

Add mechanism to allow certain RPC calls to bypass sync

Resolved

Chen Liang

39.

Throw retriable exception for getBlockLocations when ObserverNameNode is in safemode

Resolved

Chao Sun

40.

Add a configuration to turn on/off observer reads

Open

Shweta

100%

41.

Handle BlockMissingException when reading from observer

Resolved

Chao Sun

42.

Unit Test for transitioning between different states

Resolved

Sherwood Zheng

43.

Fix crlf line endings in HDFS-12943 branch

Resolved

Konstantin Shvachko

44.

Test reads from standby on a secure cluster with IP failover

Resolved

Chen Liang

45.

TestObserverNode refactoring

Resolved

Konstantin Shvachko

46.

Introduce the single Observer failure

Resolved

Sherwood Zheng

47.

ObserverReadProxyProvider should enable observer read by default

Resolved

Chen Liang

48.

ObserverReadProxyProviderWithIPFailover should work with HA configuration

Resolved

Chen Liang

49.

Emulate Observer node falling far behind the Active

Resolved

Sherwood Zheng

50.

NN status discovery does not leverage delegation token

Resolved

Chen Liang

51.

Test reads from standby on a secure cluster with Configured failover

Resolved

Plamen Jeliazkov

52.

Allow manual failover between standby and observer

Resolved

Chao Sun

53.

Allow manual transition from Standby to Observer

Resolved

Unassigned

54.

Fix the order of logging arguments in ObserverReadProxyProvider.

Resolved

Ayush Saxena

55.

Fix class cast error in NNThroughputBenchmark with ObserverReadProxyProvider.

Resolved

Chao Sun

56.

ORFPP should also clone DT for the virtual IP

Resolved

Chen Liang

57.

Make ZKFC ObserverNode aware

Resolved

xiangheng

58.

Create user guide for "Consistent reads from Observer" feature.

Resolved

Chao Sun

59.

Move ipfailover config key out of HdfsClientConfigKeys

Resolved

Chen Liang

60.

Handle exception from internalQueueCall

Resolved

Chao Sun

61.

Adjust annotations on new interfaces/classes for SBN reads.

Resolved

Chao Sun

62.

Description errors in the comparison logic of transaction ID

Resolved

xiangheng

63.

Update "Consistent Read from Observer" User Guide with Edit Tailing Frequency

Resolved

Erik Krogen

64.

Document dfs.ha.tail-edits.period in user guide.

Resolved

Chao Sun

65.

ObserverReadInvocationHandler should implement RpcInvocationHandler

Resolved

Konstantin Shvachko

66.

Balancer should work with ObserverNode

Resolved

Erik Krogen

67.

Fix white spaces related to SBN reads.

Resolved

Konstantin Shvachko

68.

[SBN read] Unclear Log.WARN message in GlobalStateIdContext

Resolved

Shweta

69.

[SBN Read] StateId and TrasactionId not present in Trace level logging

Resolved

Shweta

70.

Throwing RemoteException in the time of Read Operation

Resolved

Unassigned

71.

[SBN Read] Add the document link to the top page

Resolved

Takanobu Asanuma

72.

[SBN read] Got an unexpected txid when tail editlog

Resolved

Zhaohui Wang

73.

Fix logging error in TestEditLog#testMultiStreamsLoadEditWithConfMaxTxns

Resolved

Jonathan Hung

74.

[SBN read] Change client logging to be less aggressive

Resolved

Chen Liang

75.

[SBN read] StanbyNode does not come out of safemode while adding new blocks.

Resolved

Unassigned

76.

[SBN read] reportBadBlock is rejected by Observer.

Open

Unassigned

77.

[SBN read] Revisit GlobalStateIdContext locking when getting server state id

Resolved

Chen Liang

78.

[SBN read] Allow configurably enable/disable AlignmentContext on NameNode

Resolved

Chen Liang

79.

Prevent Observer NameNode from becoming StandBy NameNode

Resolved

Aihua Xu

80.

RBF: Support observer node from Router-Based Federation

Resolved

Simbarashe Dzinamarira

Activity

Ascending order - Click to sort in descending order

Konstantin Shvachko added a comment - 19/Dec/17 20:18

The design document covers motivation, main requirements, and potential solutions. It describes the consistency model, gives examples and use cases, introduces the new API, discusses implementation details. The roadmap lists four major stages and sets HDFS-10702 as the initial stage.

Konstantin Shvachko added a comment - 19/Dec/17 20:18 The design document covers motivation, main requirements, and potential solutions. It describes the consistency model, gives examples and use cases, introduces the new API, discusses implementation details. The roadmap lists four major stages and sets HDFS-10702 as the initial stage.

Erik Krogen added a comment - 19/Dec/17 21:49

We have been running some performance experiments (using Dynamometer) to try to determine just how large the potential benefits to be gained by this feature are. Using the tool, we replayed a few hours of traces from a production cluster against a simulated NameNode, filtering out different % of read requests to mimic the ANN's point-of-view of requests going to the standby. We tried filtering out 0%, 20%, 50%, and 100% of read requests, and also tried replaying our write workload only at 2x and 4x speed to get an estimate of throughput under the ideal (all reads offloaded) conditions.

	0% Skip	20% Skip	50% Skip	100% Skip	100% Skip (2x)	100% Skip (4x)
Average Write Latency (ms)	52.8	28.5	18.0	14.0	27.0	73.2
Average Read Latency (ms)	34.3	20.0	11.5	N/A	N/A	N/A
RPC Queue AvgTime (ms)	23.0	11.9	7.4	1.7	4.3	20.7
RPC Queue 50th Percentile (ms)	2.81	0.52	0.47	0.05	0.05	0.04
RPC Queue 90th Percentile (ms)	24.42	12.51	9.98	0.12	1.49	12.96
RPC Queue NumOps (k)	31.0	25.2	16.3	1.5	3.0	6.0
LockQueueLength Average	45.3	24.9	18.9	7.0	12.5	30.6
GC Time (ms)	9.62	7.94	6.13	1.94	3.03	5.49

The results above indicate that, if we were able to offload all read requests, we should expect up to 4x throughput improvement for the write workload.

Erik Krogen added a comment - 19/Dec/17 21:49 We have been running some performance experiments (using Dynamometer ) to try to determine just how large the potential benefits to be gained by this feature are. Using the tool, we replayed a few hours of traces from a production cluster against a simulated NameNode, filtering out different % of read requests to mimic the ANN's point-of-view of requests going to the standby. We tried filtering out 0%, 20%, 50%, and 100% of read requests, and also tried replaying our write workload only at 2x and 4x speed to get an estimate of throughput under the ideal (all reads offloaded) conditions. 0% Skip 20% Skip 50% Skip 100% Skip 100% Skip (2x) 100% Skip (4x) Average Write Latency (ms) 52.8 28.5 18.0 14.0 27.0 73.2 Average Read Latency (ms) 34.3 20.0 11.5 N/A N/A N/A RPC Queue AvgTime (ms) 23.0 11.9 7.4 1.7 4.3 20.7 RPC Queue 50th Percentile (ms) 2.81 0.52 0.47 0.05 0.05 0.04 RPC Queue 90th Percentile (ms) 24.42 12.51 9.98 0.12 1.49 12.96 RPC Queue NumOps (k) 31.0 25.2 16.3 1.5 3.0 6.0 LockQueueLength Average 45.3 24.9 18.9 7.0 12.5 30.6 GC Time (ms) 9.62 7.94 6.13 1.94 3.03 5.49 The results above indicate that, if we were able to offload all read requests, we should expect up to 4x throughput improvement for the write workload.

Christopher Douglas added a comment - 19/Dec/17 22:10

Thanks for the document and benchmarking. This is really cool.

Right now, writes are effectively throttled by blocking reads e.g., conditional checks before doing a rename. So if the NN is under heavy load, most applications will appear to back off because all these operations are blocking. If StandbyNodes serve many of these reads, then the write rate to the primary NameNode will increase. Have you tried running workloads against the PoC to get a sense for the "natural" increase in write traffic? In some deployments, would it make sense to disallow reads from the primary to prevent clients from harming overall cluster throughput?

Christopher Douglas added a comment - 19/Dec/17 22:10 Thanks for the document and benchmarking. This is really cool. Right now, writes are effectively throttled by blocking reads e.g., conditional checks before doing a rename. So if the NN is under heavy load, most applications will appear to back off because all these operations are blocking. If StandbyNodes serve many of these reads, then the write rate to the primary NameNode will increase. Have you tried running workloads against the PoC to get a sense for the "natural" increase in write traffic? In some deployments, would it make sense to disallow reads from the primary to prevent clients from harming overall cluster throughput?

Konstantin Shvachko added a comment - 20/Dec/17 06:17

Chris, I do not have POC numbers. I believe csun can elaborate on this.
I agree reads are blocking writes on NN.
Disallowing reads on active NN is an interesting twist. The design proposes a new client-side config variable to enable reads from SBN. I think we can have another one to disable reads from ANN:

dfs.client.standby.reads.enabled = true - enables reads from standby
dfs.client.active.reads.enabled = false - disables reads on active and directs them exclusively to standby

Konstantin Shvachko added a comment - 20/Dec/17 06:17 Chris, I do not have POC numbers. I believe csun can elaborate on this. I agree reads are blocking writes on NN. Disallowing reads on active NN is an interesting twist. The design proposes a new client-side config variable to enable reads from SBN. I think we can have another one to disable reads from ANN: dfs.client.standby.reads.enabled = true - enables reads from standby dfs.client.active.reads.enabled = false - disables reads on active and directs them exclusively to standby

Chao Sun added a comment - 20/Dec/17 07:23

chris.douglas I did some experiment with the POC patch, on 2.8.3. It uses 5000 containers to issue read/write requests that mimic production workloads (~95% reads, ~5% write).
With stale reads enabled, I observed around 60-80K throughput on the SBN, and around 20K on the ANN for write throughput. Without stale reads, the total throughput on the ANN was around 35-40K.
Also, with stale reads, the write throughput on ANN was 2-2.5X higher, while the GC time dropped from around 6s/min to 2s/min.

Hope this helps, and let me know if you need more data.

Chao Sun added a comment - 20/Dec/17 07:23 chris.douglas I did some experiment with the POC patch, on 2.8.3. It uses 5000 containers to issue read/write requests that mimic production workloads (~95% reads, ~5% write). With stale reads enabled, I observed around 60-80K throughput on the SBN, and around 20K on the ANN for write throughput. Without stale reads, the total throughput on the ANN was around 35-40K. Also, with stale reads, the write throughput on ANN was 2-2.5X higher, while the GC time dropped from around 6s/min to 2s/min. Hope this helps, and let me know if you need more data.

Zhe Zhang added a comment - 20/Dec/17 08:49 - edited

Thanks csun, interesting results! You used only 1 SBN to server reads right? In both configurations (with and without stale reads), I assume you were saturating the system? It's interesting to see that with two NNs serving RPCs (1 ANN + 1 SBN), the throughput actually more than doubled the throughput with 1 ANN. Did you use Namesystem unfair locking?

If I understand correctly, both your test and the Dynamometer test are more like trace-driven micro benchmarks, where a container issues a certain type of RPC at given timestamp. Chris was probably referring to a test job with "real code" like if !file_exists(path) then create_file(path), where the blocking relationship between calls are miniced.

chris.douglas: the "natural" increase of write traffic is an interesting question. I don't think the feature will increase the total amount of write RPCs (a given job will still issue that many writes overall). Writes within a job could become more bursty but the job itself will run for shorter. Statistically, the 1000s of jobs on the cluster would probably smooth out this increased burstiness.

Zhe Zhang added a comment - 20/Dec/17 08:49 - edited Thanks csun , interesting results! You used only 1 SBN to server reads right? In both configurations (with and without stale reads), I assume you were saturating the system? It's interesting to see that with two NNs serving RPCs (1 ANN + 1 SBN), the throughput actually more than doubled the throughput with 1 ANN. Did you use Namesystem unfair locking? If I understand correctly, both your test and the Dynamometer test are more like trace-driven micro benchmarks, where a container issues a certain type of RPC at given timestamp. Chris was probably referring to a test job with "real code" like if !file_exists(path) then create_file(path) , where the blocking relationship between calls are miniced. chris.douglas : the "natural" increase of write traffic is an interesting question. I don't think the feature will increase the total amount of write RPCs (a given job will still issue that many writes overall). Writes within a job could become more bursty but the job itself will run for shorter. Statistically, the 1000s of jobs on the cluster would probably smooth out this increased burstiness.

Chao Sun added a comment - 20/Dec/17 18:31

Thanks Chao Sun, interesting results! You used only 1 SBN to server reads right?

Yes I used 1 ANN + 1SBN + 1ONN (observer NN).

In both configurations (with and without stale reads), I assume you were saturating the system?

In the stale read case, the RPC queue time on the ANN was less than 5ms, while on ONN it was between 0 to 30ms. In the non-stale read case, the RPC queue time on ANN was around 130-140ms. So I guess the ANN was not saturated when stale read is enabled?

Did you use Namesystem unfair locking?

The ANN didn't use unfair locking. The ONN used unfair locking + async audit logging (we have an internal patch to use log4j 2.x) + async edit logging. Do you think it will make a difference if unfair locking is used on ANN?

If I understand correctly, both your test and the Dynamometer test are more like trace-driven micro benchmarks, where a container issues a certain type of RPC at given timestamp. Chris was probably referring to a test job with "real code" like if !file_exists(path) then create_file(path), where the blocking relationship between calls are miniced.

Yes the test was pretty simple. It is basically:

loop {
  x = randInt(0, 100)
  if (x < 6) {
    fs.createNewFile(..)
    fs.rename(..)
    fs.delete(..)
  } else if (x < 10) {
    fs.listStatus(..)
  } else if (x < 40) {
    fs.getFileBlockLocations(..)
  } else {
    fs.getFileStatus(..)
  }
}

The file listing was done on a directory with 2K files.
Let me know if you have any suggestion on improving this. It's pretty easy to change the code and re-run the benchmark.

Chao Sun added a comment - 20/Dec/17 18:31 Thanks Chao Sun, interesting results! You used only 1 SBN to server reads right? Yes I used 1 ANN + 1SBN + 1ONN (observer NN). In both configurations (with and without stale reads), I assume you were saturating the system? In the stale read case, the RPC queue time on the ANN was less than 5ms, while on ONN it was between 0 to 30ms. In the non-stale read case, the RPC queue time on ANN was around 130-140ms. So I guess the ANN was not saturated when stale read is enabled? Did you use Namesystem unfair locking? The ANN didn't use unfair locking. The ONN used unfair locking + async audit logging (we have an internal patch to use log4j 2.x) + async edit logging. Do you think it will make a difference if unfair locking is used on ANN? If I understand correctly, both your test and the Dynamometer test are more like trace-driven micro benchmarks, where a container issues a certain type of RPC at given timestamp. Chris was probably referring to a test job with "real code" like if !file_exists(path) then create_file(path), where the blocking relationship between calls are miniced. Yes the test was pretty simple. It is basically: loop { x = randInt(0, 100) if (x < 6) { fs.createNewFile(..) fs.rename(..) fs.delete(..) } else if (x < 10) { fs.listStatus(..) } else if (x < 40) { fs.getFileBlockLocations(..) } else { fs.getFileStatus(..) } } The file listing was done on a directory with 2K files. Let me know if you have any suggestion on improving this. It's pretty easy to change the code and re-run the benchmark.

Virajith Jalaparti added a comment - 21/Dec/17 01:59

Hi shv, thanks for posting the design document. One thing that wasn't clear to me from the design doc itself was what's the function of the Observer Nodes. Are these what the clients actually use to read, instead of the real SBN?
Further, what's the goal to of having them? Is it to reduce the load on the SBN further or graceful degradation during failures of NN/SBN?

Virajith Jalaparti added a comment - 21/Dec/17 01:59 Hi shv , thanks for posting the design document. One thing that wasn't clear to me from the design doc itself was what's the function of the Observer Nodes. Are these what the clients actually use to read, instead of the real SBN? Further, what's the goal to of having them? Is it to reduce the load on the SBN further or graceful degradation during failures of NN/SBN?

Konstantin Shvachko added a comment - 21/Dec/17 22:26

what's the function of the Observer Nodes

Good question. The design doc says that Observer Node is an SBN that does not do checkpoints. Checkpointing degrades performance of SBN, we wont be able to read from it when it's busy. So it's more like a term to distinguish the node which is dedicated for reading - the read-only SBN. Regular SBN is also needed though if we want checkpointing and HA on the cluster, which I do. In the "Note on HA" we talk about some failover scenarios, that reading from ObserverNode elevates it role on the cluster so that you may need to run multiple of them to sustain the response rate in case of failure.

Konstantin Shvachko added a comment - 21/Dec/17 22:26 what's the function of the Observer Nodes Good question. The design doc says that Observer Node is an SBN that does not do checkpoints. Checkpointing degrades performance of SBN, we wont be able to read from it when it's busy. So it's more like a term to distinguish the node which is dedicated for reading - the read-only SBN. Regular SBN is also needed though if we want checkpointing and HA on the cluster, which I do. In the "Note on HA" we talk about some failover scenarios, that reading from ObserverNode elevates it role on the cluster so that you may need to run multiple of them to sustain the response rate in case of failure.

Virajith Jalaparti added a comment - 02/Jan/18 20:18

Thanks for the clarification shv

Virajith Jalaparti added a comment - 02/Jan/18 20:18 Thanks for the clarification shv

Konstantin Shvachko added a comment - 21/Mar/18 02:02

Cut the branch origin/HDFS-12943. When committing please do not forget:

To prepend jira description with [SBN read]. This should help to distinguish the branch from the trunk commits.
Merge trunk to the branch before committing.

Konstantin Shvachko added a comment - 21/Mar/18 02:02 Cut the branch origin/ HDFS-12943 . When committing please do not forget: To prepend jira description with [SBN read] . This should help to distinguish the branch from the trunk commits. Merge trunk to the branch before committing.

Konstantin Shvachko added a comment - 23/Mar/18 21:39

Updated the design doc. Included a section in Implementation details describing startup sequence, configuration for NameNodes, and state transitions. Also added references to fast path for tailing edits.

Konstantin Shvachko added a comment - 23/Mar/18 21:39 Updated the design doc. Included a section in Implementation details describing startup sequence, configuration for NameNodes, and state transitions. Also added references to fast path for tailing edits.

Xiao Chen added a comment - 02/Jun/18 00:05

Thanks all for the for the work! (and sorry for the late response here) Just read through the design doc and the comments, looks great!

I have 2 questions:

About 'Optimization 1':

Currently atime is created to be the same as mtime, and only gets updated if "dfs.namenode.accesstime.precision" has passed. Does this mean we require a really small atime precision? (Anecdotally, snapshot will capture a diff on the inode if atime is different. So if someone takes daily snapshots for a week, atime precision of a week will only resulting in 1 object being created while atime precision < 1 day will resulting in 7.).

About Observer nodes:

How is the failover handled? Currently ANN <~~> SBN is done by failover controller racing to write to zookeeper. For the observer node <~~> SBN transition, how is it done?

Xiao Chen added a comment - 02/Jun/18 00:05 Thanks all for the for the work! (and sorry for the late response here) Just read through the design doc and the comments, looks great! I have 2 questions: About 'Optimization 1': Currently atime is created to be the same as mtime, and only gets updated if "dfs.namenode.accesstime.precision" has passed. Does this mean we require a really small atime precision? (Anecdotally, snapshot will capture a diff on the inode if atime is different. So if someone takes daily snapshots for a week, atime precision of a week will only resulting in 1 object being created while atime precision < 1 day will resulting in 7.). About Observer nodes: How is the failover handled? Currently ANN < > SBN is done by failover controller racing to write to zookeeper. For the observer node < > SBN transition, how is it done?

Chao Sun added a comment - 07/Jun/18 05:23

xiaochen, on the second question, current the transition from SBN to Observer is done via a haadmin command: haadmin -transitionToObserver, and vise versa you can transition Observer to SBN via haadmin -transitionToStandby. There is no automatic transition between the two, and no transition is allowed between Observer and ANN.

In terms of failover, details are yet to be discussed. Ideally we'd like to allow Observer to participate in the failover too but it is yet to be resolved. I did some preliminary work on that which you can find in the comments of ~~HDFS-12975~~. The failover handling is tracked by HDFS-13182.

Chao Sun added a comment - 07/Jun/18 05:23 xiaochen , on the second question, current the transition from SBN to Observer is done via a haadmin command: haadmin -transitionToObserver , and vise versa you can transition Observer to SBN via haadmin -transitionToStandby . There is no automatic transition between the two, and no transition is allowed between Observer and ANN. In terms of failover, details are yet to be discussed. Ideally we'd like to allow Observer to participate in the failover too but it is yet to be resolved. I did some preliminary work on that which you can find in the comments of HDFS-12975 . The failover handling is tracked by HDFS-13182 .

Konstantin Shvachko added a comment - 04/Sep/18 23:37

Attached Test Plan document.

Konstantin Shvachko added a comment - 04/Sep/18 23:37 Attached Test Plan document.

xiangheng added a comment - 12/Nov/18 07:58

Thanks csun ,I configured hdfs-site.xml according to the plan document and used the haadmin command: {{haadmin -transitionToObserver,}}But transition from SBN to Observer state failed,And have a prompt message :transitionToObserver: incorrect arguments,Can you tell me the configuration of the observer namenode related in detail?thank you very much.

xiangheng added a comment - 12/Nov/18 07:58 Thanks csun ,I configured hdfs-site.xml according to the plan document and used the haadmin command: {{haadmin -transitionToObserver,}}But transition from SBN to Observer state failed,And have a prompt message : transitionToObserver: incorrect arguments ,Can you tell me the configuration of the observer namenode related in detail?thank you very much.

Chen Liang added a comment - 12/Nov/18 18:31

xiangheng thanks for trying Observer read! What was the full command you ran? It should be something like hdfs haadmin -transitionToObserver <nnID> where nnID is the ID of the name node that you want to transition to Observer. You can run hdfs haadmin -getAllServiceState to list all the valid nnIDs in the cluster.

Chen Liang added a comment - 12/Nov/18 18:31 xiangheng thanks for trying Observer read! What was the full command you ran? It should be something like hdfs haadmin -transitionToObserver <nnID> where nnID is the ID of the name node that you want to transition to Observer. You can run hdfs haadmin -getAllServiceState to list all the valid nnIDs in the cluster.

xiangheng added a comment - 15/Nov/18 02:44

Thanks vagarychen(and sorry for the late response ),I have successfully transform namenode from Standby to Observer state,But i need to set ha.automatic-failover=false and close the ZKFC process, Whether we should consider while realizing the namenode state transition and supporting the ha.automatic-failover?thank you very much.

xiangheng added a comment - 15/Nov/18 02:44 Thanks vagarychen (and sorry for the late response ),I have successfully transform namenode from Standby to Observer state,But i need to set ha.automatic-failover=false and close the ZKFC process, Whether we should consider while realizing the namenode state transition and supporting the ha.automatic-failover?thank you very much.

Chao Sun added a comment - 15/Nov/18 03:18

xiangheng: we are still working on the support for state transition between standby/observer in the auto failover environment. You can watch ~~HDFS-14067~~, HDFS-13182 and ~~HDFS-14059~~ for more detailed information.

At the moment, one workaround is to not launch ZK failover controller on the host where the observer is at. Let me know if this works for you.

Chao Sun added a comment - 15/Nov/18 03:18 xiangheng : we are still working on the support for state transition between standby/observer in the auto failover environment. You can watch HDFS-14067 , HDFS-13182 and HDFS-14059 for more detailed information. At the moment, one workaround is to not launch ZK failover controller on the host where the observer is at. Let me know if this works for you.

xiangheng added a comment - 15/Nov/18 08:43

Thanks csun,I have tried this way but failed

one workaround is to not launch ZK failover controller on the host where the observer is at. Let me know if this works for you.

i have three namenode (nn1,nn2,nn3),if i launch ZK failover controller between nn1 and nn2,and transform the state of nn3 from standby to observer,it will be failed.

Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the --forcemanual flag.
journal# hdfs haadmin -transitionToObserver --forcemanual nn3
transitionToObserver: incorrect arguments
i will focus on ~~HDFS-14067~~, HDFS-13182 and ~~HDFS-14059~~,thanks for your suggestions.

xiangheng added a comment - 15/Nov/18 08:43 Thanks csun ,I have tried this way but failed one workaround is to not launch ZK failover controller on the host where the observer is at. Let me know if this works for you. i have three namenode (nn1,nn2,nn3),if i launch ZK failover controller between nn1 and nn2,and transform the state of nn3 from standby to observer,it will be failed. Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the --forcemanual flag. journal # hdfs haadmin -transitionToObserver --forcemanual nn3 transitionToObserver: incorrect arguments i will focus on HDFS-14067 , HDFS-13182 and HDFS-14059 ,thanks for your suggestions.

Chao Sun added a comment - 15/Nov/18 17:41

xiangheng you are right - one more patch is required to make this work - you can check ~~HDFS-14067~~ for the fix. Thanks.

Chao Sun added a comment - 15/Nov/18 17:41 xiangheng you are right - one more patch is required to make this work - you can check HDFS-14067 for the fix. Thanks.

xiangheng added a comment - 16/Nov/18 07:10 - edited

Hi,csun,I am very glad to communicate this question with you,I have checked ~~HDFS-14067~~ and make a test,It seems that the problem is still unsolved.If you agree with it,I will propose a new issue and try my best to solve this problem,please let me know if you have any suggestions.thank you very much.

xiangheng added a comment - 16/Nov/18 07:10 - edited Hi, csun ,I am very glad to communicate this question with you,I have checked HDFS-14067 and make a test,It seems that the problem is still unsolved.If you agree with it,I will propose a new issue and try my best to solve this problem,please let me know if you have any suggestions.thank you very much.

Konstantin Shvachko added a comment - 06/Dec/18 00:36

Submitting a unified patch for ~~HDFS-12943~~ branch for review and for a Jenkins run.

Konstantin Shvachko added a comment - 06/Dec/18 00:36 Submitting a unified patch for HDFS-12943 branch for review and for a Jenkins run.

Íñigo Goiri added a comment - 06/Dec/18 18:00

Is there a JIRA tracking the documentation/user guide?
I think we should be able to push that fairly fast.

Íñigo Goiri added a comment - 06/Dec/18 18:00 Is there a JIRA tracking the documentation/user guide? I think we should be able to push that fairly fast.

Konstantin Shvachko added a comment - 06/Dec/18 19:08

Hey goiri, see ~~HDFS-14131~~ - the documentation jira.

Konstantin Shvachko added a comment - 06/Dec/18 19:08 Hey goiri , see HDFS-14131 - the documentation jira.

Brahma Reddy Battula added a comment - 14/Dec/18 04:32 - edited

Thanks all for great work here.

I think,write requests can be degraded..? As they also contains some read requests like getFileinfo(),getServerDefaults() ...(getHAServiceState() is newly added) .

Just I had checked for mkdir perf,it's like below.

i) getHAServiceState() took 2+ sec ( 3 getHAServiceState() + 2 getFileInfo() + 1 mkdirs = 6 calls)
ii) Every second request is getting timedout[1] and rpc call is getting skipped from observer.( 7 getHAServiceState() + 4 getFileInfo() + 1 mkdirs = 12 calls).Here two getFileInfo() skipped from observer hence it's success with Active.

time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.hacluster=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF1
real 0m4.314s
user 0m3.668s
sys 0m0.272s
time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.hacluster=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF2
real 0m22.238s
user 0m3.800s
sys 0m0.248s

without ObserverReadProxyProvider ( 2 getFileInfo() + 1 mkdirs() = 3 Calls)

time ./hdfs --loglevel debug dfs  -mkdir /TestsCFP
real 0m2.105s
user 0m3.768s
sys 0m0.592s

Please correct me if I am missing anyting.

timedout[1],Every second write request I am getting following, did I miss something here,these calls are skipped from observer.

2018-12-14 11:21:45,312 DEBUG ipc.Client: closing ipc connection to vm1/10.*.*.*:65110: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.*.*.*:58409 remote=vm1/10.*.*.*:65110]
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.*.*.*:58409 remote=vm1/10.*.*.*:65110]
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:567)
 at java.io.DataInputStream.readInt(DataInputStream.java:387)
 at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1849)
 at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)
2018-12-14 11:21:45,313 DEBUG ipc.Client: IPC Client (1006094903) connection to vm1/10.*.*.*:65110 from brahma: closed

Brahma Reddy Battula added a comment - 14/Dec/18 04:32 - edited Thanks all for great work here. I think,write requests can be degraded..? As they also contains some read requests like getFileinfo(),getServerDefaults() ...(getHAServiceState() is newly added) . Just I had checked for mkdir perf,it's like below. i) getHAServiceState() took 2+ sec ( 3 getHAServiceState() + 2 getFileInfo() + 1 mkdirs = 6 calls) ii) Every second request is getting timedout [1] and rpc call is getting skipped from observer.( 7 getHAServiceState() + 4 getFileInfo() + 1 mkdirs = 12 calls).Here two getFileInfo() skipped from observer hence it's success with Active. time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.hacluster=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF1 real 0m4.314s user 0m3.668s sys 0m0.272s time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.hacluster=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF2 real 0m22.238s user 0m3.800s sys 0m0.248s without ObserverReadProxyProvider ( 2 getFileInfo() + 1 mkdirs() = 3 Calls) time ./hdfs --loglevel debug dfs -mkdir /TestsCFP real 0m2.105s user 0m3.768s sys 0m0.592s Please correct me if I am missing anyting. timedout [1] ,Every second write request I am getting following, did I miss something here,these calls are skipped from observer. 2018-12-14 11:21:45,312 DEBUG ipc.Client: closing ipc connection to vm1/10.*.*.*:65110: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.*.*.*:58409 remote=vm1/10.*.*.*:65110] java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.*.*.*:58409 remote=vm1/10.*.*.*:65110] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:567) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1849) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079) 2018-12-14 11:21:45,313 DEBUG ipc.Client: IPC Client (1006094903) connection to vm1/10.*.*.*:65110 from brahma: closed

Erik Krogen added a comment - 14/Dec/18 16:49 - edited

Hey brahmareddy, thanks for trying it out and for the detailed feedback!

I think when we discuss a "request", we need to differentiate an RPC request originating from a Java application (MapReduce task, etc.) vs. a CLI request. The former will be the vast majority of operations on a typical cluster, so I would argue that optimizing for the performance and efficiency of that usage is much more important. The ObserverReadProxyProvider does have higher startup overheads as it directly polls for the state rather than just blindly trying its request; however, in an application which performs more than a few RPCs, this cost will be easily amortized away. I don't think it's fair to say that "write" performance is degraded simply because hdfs dfs -mkdirs takes longer; a benchmark running 100+ mkdirs would be a better measure IMO. If CLI performance is important, such clients can continue to use ConfiguredFailoverProxyProvider and communicate with the active directly.

The timeout you have shared is interesting. I suspect that it may be caused by the Observer trying to wait for its state to catch up to the stateID requested by your getFileInfo. I have a few questions:

Are you running with ~~HDFS-13873~~? With this patch (only committed yesterday so I doubt you have it) the exception thrown should be more meaningful.
Did you remember to enable in-progress edit log tailing?
Was this run on an almost completely stagnant cluster (no other writes)? This can make the ANN flush its edits to the JNs less frequently, increasing the lag time between ANN and Observer.

Erik Krogen added a comment - 14/Dec/18 16:49 - edited Hey brahmareddy , thanks for trying it out and for the detailed feedback! I think when we discuss a "request", we need to differentiate an RPC request originating from a Java application (MapReduce task, etc.) vs. a CLI request. The former will be the vast majority of operations on a typical cluster, so I would argue that optimizing for the performance and efficiency of that usage is much more important. The ObserverReadProxyProvider does have higher startup overheads as it directly polls for the state rather than just blindly trying its request; however, in an application which performs more than a few RPCs, this cost will be easily amortized away. I don't think it's fair to say that "write" performance is degraded simply because hdfs dfs -mkdirs takes longer; a benchmark running 100+ mkdirs would be a better measure IMO. If CLI performance is important, such clients can continue to use ConfiguredFailoverProxyProvider and communicate with the active directly. The timeout you have shared is interesting. I suspect that it may be caused by the Observer trying to wait for its state to catch up to the stateID requested by your getFileInfo. I have a few questions: Are you running with HDFS-13873 ? With this patch (only committed yesterday so I doubt you have it) the exception thrown should be more meaningful. Did you remember to enable in-progress edit log tailing? Was this run on an almost completely stagnant cluster (no other writes)? This can make the ANN flush its edits to the JNs less frequently, increasing the lag time between ANN and Observer.

Brahma Reddy Battula added a comment - 16/Dec/18 19:35 - edited

I think when we discuss a "request", we need to differentiate an RPC request originating from a Java application (MapReduce task, etc.) vs. a CLI request. The former will be the vast majority of operations on a typical cluster, so I would argue that optimizing for the performance and efficiency of that usage is much more important.

Agree, I Could have mentioned CLI. But getHAServiceState() call from ORP which taken 2s+ as I mentioned above.Bytheway My intent was when read/write are combined in single application how much will be impact as it needs switch?

Just for curiosity,,do we've write benchmarks with and without ORP,as I didn't find from ~~HDFS-14058~~ and ~~HDFS-14059~~?

1.Are you running with ~~HDFS-13873~~? With this patch (only committed yesterday so I doubt you have it) the exception thrown should be more meaningful.

Yes,with latest ~~HDFS-12943~~ branch.

2.Did you remember to enable in-progress edit log tailing?

Yes,Enabled for three NN's

3.Was this run on an almost completely stagnant cluster (no other writes)? This can make the ANN flush its edits to the JNs less frequently, increasing the lag time between ANN and Observer.

Yes,no other writes.

Tried the following test with and with ORF,Came to know it's(perf impact) based on the tailing edits("dfs.ha.tail-edits.period") which is default 1m.(In tests, it's 100MS)..

@Test
 public void testSimpleRead() throws Exception {
 long avg=0;
 long avgL=0;
 long avgC=0;
 int num = 100;
 for (int i = 0; i < num; i++) {
 Path testPath1 = new Path(testPath, "test1"+i);
 long startTime=System.currentTimeMillis();
 assertTrue(dfs.mkdirs(testPath1, FsPermission.getDefault()));
 long l = System.currentTimeMillis() - startTime;
 System.out.println("time TakenL1: "+i+" : "+l);
 avg = avg+l;
 assertSentTo(0);
 long startTime2=System.currentTimeMillis();
 dfs.getContentSummary(testPath1);
 long C = System.currentTimeMillis() - startTime2;
 System.out.println("time TakengetContentSummary: "+i+" : "+ C);
 avgC = avgC+C;
 assertSentTo(2);
 long startTime1=System.currentTimeMillis();
 dfs.getFileStatus(testPath1);
 long L = System.currentTimeMillis() - startTime1;
 System.out.println("time TakengetFileStatus: "+i+" : "+ L);
 avgL = avgL+L;
 assertSentTo(2);
}
 System.out.println("AVG: mkDir: "+avg/num+" List: "+avgL/num+" Cont: "+avgC/num);
}

IMO,Configuring less value(like 100ms) for reading ingress edits put load on journalnode till log roll happens(2mins by default),as it's open the stream to read the edits.

Apart from the perf i have following queries.
i) Did we try with C/CPP client..?
ii)are we planning separate metrics for observer reads(Client Side),Application like mapred might helpful for job counters?

Brahma Reddy Battula added a comment - 16/Dec/18 19:35 - edited I think when we discuss a "request", we need to differentiate an RPC request originating from a Java application (MapReduce task, etc.) vs. a CLI request. The former will be the vast majority of operations on a typical cluster, so I would argue that optimizing for the performance and efficiency of that usage is much more important. Agree, I Could have mentioned CLI. But getHAServiceState() call from ORP which taken 2s+ as I mentioned above.Bytheway My intent was when read/write are combined in single application how much will be impact as it needs switch? Just for curiosity,,do we've write benchmarks with and without ORP,as I didn't find from HDFS-14058 and HDFS-14059 ? 1.Are you running with HDFS-13873 ? With this patch (only committed yesterday so I doubt you have it) the exception thrown should be more meaningful. Yes,with latest HDFS-12943 branch. 2.Did you remember to enable in-progress edit log tailing? Yes,Enabled for three NN's 3.Was this run on an almost completely stagnant cluster (no other writes)? This can make the ANN flush its edits to the JNs less frequently, increasing the lag time between ANN and Observer. Yes,no other writes. Tried the following test with and with ORF,Came to know it's(perf impact) based on the tailing edits(" dfs.ha.tail-edits.period") which is default 1m.(In tests, it's 100MS).. @Test public void testSimpleRead() throws Exception { long avg=0; long avgL=0; long avgC=0; int num = 100; for ( int i = 0; i < num; i++) { Path testPath1 = new Path(testPath, "test1" +i); long startTime= System .currentTimeMillis(); assertTrue(dfs.mkdirs(testPath1, FsPermission.getDefault())); long l = System .currentTimeMillis() - startTime; System .out.println( "time TakenL1: " +i+ " : " +l); avg = avg+l; assertSentTo(0); long startTime2= System .currentTimeMillis(); dfs.getContentSummary(testPath1); long C = System .currentTimeMillis() - startTime2; System .out.println( "time TakengetContentSummary: " +i+ " : " + C); avgC = avgC+C; assertSentTo(2); long startTime1= System .currentTimeMillis(); dfs.getFileStatus(testPath1); long L = System .currentTimeMillis() - startTime1; System .out.println( "time TakengetFileStatus: " +i+ " : " + L); avgL = avgL+L; assertSentTo(2); } System .out.println( "AVG: mkDir: " +avg/num+ " List: " +avgL/num+ " Cont: " +avgC/num); } IMO,Configuring less value(like 100ms) for reading ingress edits put load on journalnode till log roll happens(2mins by default),as it's open the stream to read the edits. Apart from the perf i have following queries. i) Did we try with C/CPP client..? ii)are we planning separate metrics for observer reads(Client Side),Application like mapred might helpful for job counters?

Chen Liang added a comment - 17/Dec/18 19:00 - edited

Hi brahmareddy,

Thanks for testing! The timeout issue seems interesting. To start with, it is expected to see some performance degradation from CLI, because CLI initiates a DFSClient every time for each command, a fresh DFSClient has to get status of name nodes every time. But if it is the same DFSClient being reused, this would not be an issue. I have never seen the second-call issue. Here is an output from our cluster (log outpu part omitted), and I think you are right about lowering dfs.ha.tail-edits.period, we had similar numbers here:

$time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.***=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF1
real	0m2.254s
user	0m3.608s
sys	0m0.331s
$time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.***=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF2
real	0m2.159s
user	0m3.855s
sys	0m0.330s

Curious, how many NN you had in the testing? and was there any error from NN logs?

Chen Liang added a comment - 17/Dec/18 19:00 - edited Hi brahmareddy , Thanks for testing! The timeout issue seems interesting. To start with, it is expected to see some performance degradation from CLI , because CLI initiates a DFSClient every time for each command, a fresh DFSClient has to get status of name nodes every time. But if it is the same DFSClient being reused, this would not be an issue. I have never seen the second-call issue. Here is an output from our cluster (log outpu part omitted), and I think you are right about lowering dfs.ha.tail-edits.period, we had similar numbers here: $time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.***=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF1 real 0m2.254s user 0m3.608s sys 0m0.331s $time hdfs --loglevel debug dfs -Ddfs.client.failover.proxy.provider.***=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider -mkdir /TestsORF2 real 0m2.159s user 0m3.855s sys 0m0.330s Curious, how many NN you had in the testing? and was there any error from NN logs?

Chao Sun added a comment - 17/Dec/18 19:10

I think we should document dfs.ha.tail-edits.period in the user guide - the default value is just too large for observer reads. Filed ~~HDFS-14154~~.

Chao Sun added a comment - 17/Dec/18 19:10 I think we should document dfs.ha.tail-edits.period in the user guide - the default value is just too large for observer reads. Filed HDFS-14154 .

Erik Krogen added a comment - 17/Dec/18 19:37 - edited

Bytheway My intent was when read/write are combined in single application how much will be impact as it needs switch?

There will only be potential performance impact when switching from writes (sent to Active) to reads (sent to Observer) since the client may need to wait some time for the state on the Observer to catch up. Experience when designing ~~HDFS-13150~~ indicated that this delay time could be reduced to a few ms when properly tuned, which would make the delay of switching from Active to Observer negligible. See the design doc, especially Appendix A, for more details.

Just for curiosity,,do we've write benchmarks with and without ORP,as I didn't find from ~~HDFS-14058~~ and ~~HDFS-14059~~?

There are some preliminary performance numbers shared in my earlier comment in this thread. I'm not aware of any good benchmark numbers produced after finishing the feature, maybe csun can provide them?

Tried the following test with and with ORF,Came to know it's(perf impact) based on the tailing edits("dfs.ha.tail-edits.period") which is default 1m.(In tests, it's 100MS)..
...
IMO,Configuring less value(like 100ms) for reading ingress edits put load on journalnode till log roll happens(2mins by default),as it's open the stream to read the edits.

I think I now understand the issue that you were facing. To use this feature correctly, in addition to setting dfs.ha.tail-edits.in-progress to true, you should also set dfs.ha.tail-edits.period to a small value; in our case I think we use 0 or 1 ms. Your concern about heavier load in the JournalNode would have previously been valid, but with the completion of ~~HDFS-13150~~ and dfs.ha.tail-edits.in-progress enabled, the Standby/Observer no longer creates a new stream to tail edits, instead polling for edits via RPC (and thus making use of connection keepalive). This greatly reduces the overheads involved with each iteration of edit tailing, enabling it to be done much more frequently. I created ~~HDFS-14155~~ to track updating the documentation with this information.

i) Did we try with C/CPP client..?

We haven't developed any support for these clients, no. They should continue to work on clusters with the Observer enabled but will not be able to take advantage of the new functionality.

ii)are we planning separate metrics for observer reads(Client Side),Application like mapred might helpful for job counters?

There's no metrics like this on the client side at this time, we are relying on server-side metrics, but I agree that this could be a useful addition.

Erik Krogen added a comment - 17/Dec/18 19:37 - edited Bytheway My intent was when read/write are combined in single application how much will be impact as it needs switch? There will only be potential performance impact when switching from writes (sent to Active) to reads (sent to Observer) since the client may need to wait some time for the state on the Observer to catch up. Experience when designing HDFS-13150 indicated that this delay time could be reduced to a few ms when properly tuned, which would make the delay of switching from Active to Observer negligible. See the design doc , especially Appendix A, for more details. Just for curiosity,,do we've write benchmarks with and without ORP,as I didn't find from HDFS-14058 and HDFS-14059 ? There are some preliminary performance numbers shared in my earlier comment in this thread. I'm not aware of any good benchmark numbers produced after finishing the feature, maybe csun can provide them? Tried the following test with and with ORF,Came to know it's(perf impact) based on the tailing edits("dfs.ha.tail-edits.period") which is default 1m.(In tests, it's 100MS).. ... IMO,Configuring less value(like 100ms) for reading ingress edits put load on journalnode till log roll happens(2mins by default),as it's open the stream to read the edits. I think I now understand the issue that you were facing. To use this feature correctly, in addition to setting dfs.ha.tail-edits.in-progress to true, you should also set dfs.ha.tail-edits.period to a small value; in our case I think we use 0 or 1 ms. Your concern about heavier load in the JournalNode would have previously been valid, but with the completion of HDFS-13150 and dfs.ha.tail-edits.in-progress enabled, the Standby/Observer no longer creates a new stream to tail edits, instead polling for edits via RPC (and thus making use of connection keepalive). This greatly reduces the overheads involved with each iteration of edit tailing, enabling it to be done much more frequently. I created HDFS-14155 to track updating the documentation with this information. i) Did we try with C/CPP client..? We haven't developed any support for these clients, no. They should continue to work on clusters with the Observer enabled but will not be able to take advantage of the new functionality. ii)are we planning separate metrics for observer reads(Client Side),Application like mapred might helpful for job counters? There's no metrics like this on the client side at this time, we are relying on server-side metrics, but I agree that this could be a useful addition.

Erik Krogen added a comment - 17/Dec/18 19:39

Whoops, took too long writing my comment. Thanks for also addressing the tail-edits period issue in the documentation, Chao. Will close mine as duplicate.

Erik Krogen added a comment - 17/Dec/18 19:39 Whoops, took too long writing my comment. Thanks for also addressing the tail-edits period issue in the documentation, Chao. Will close mine as duplicate.

Brahma Reddy Battula added a comment - 18/Dec/18 03:32

Hi vagarychen

I have never seen the second-call issue. Here is an output from our cluster (log outpu part omitted), and I think you are right about lowering dfs.ha.tail-edits.period, we had similar numbers here:

you can see this issue if "dfs.ha.tail-edits.period" is default value.

Curious, how many NN you had in the testing? and was there any error from NN logs?

1 ANN,1 SNN,1 Obserserver. No error logs from NN's.

Hi csun

I think we should document dfs.ha.tail-edits.period in the user guide - the default value is just too large for observer reads. Filed ~~HDFS-14154~~.

Yes, thanks for reporting the same.

Hi xkrogen

Your concern about heavier load in the JournalNode would have previously been valid, but with the completion of ~~HDFS-13150~~ and dfs.ha.tail-edits.in-progress enabled, the Standby/Observer no longer creates a new stream to tail edits, instead polling for edits via RPC (and thus making use of connection keepalive). This greatly reduces the overheads involved with each iteration of edit tailing, enabling it to be done much more frequently.

Yes,this is one of my concern. Gone through fast path(~~HDFS-13150~~) thanks,it can improve.

I'm not aware of any good benchmark numbers produced after finishing the feature, maybe csun can provide them?

csun can you provide..? I am sure this feature going to be great advantage over rpc workload on ANN, just i want to know write benchmarks also ( as getHAserviceState() and fast editing tailing edits are intrdouced).Sorry for pitching very late..

Brahma Reddy Battula added a comment - 18/Dec/18 03:32 Hi vagarychen I have never seen the second-call issue. Here is an output from our cluster (log outpu part omitted), and I think you are right about lowering dfs.ha.tail-edits.period, we had similar numbers here: you can see this issue if "dfs.ha.tail-edits.period" is default value. Curious, how many NN you had in the testing? and was there any error from NN logs? 1 ANN,1 SNN,1 Obserserver. No error logs from NN's. Hi csun I think we should document dfs.ha.tail-edits.period in the user guide - the default value is just too large for observer reads. Filed HDFS-14154 . Yes, thanks for reporting the same. Hi xkrogen Your concern about heavier load in the JournalNode would have previously been valid, but with the completion of HDFS-13150 and dfs.ha.tail-edits.in-progress enabled, the Standby/Observer no longer creates a new stream to tail edits, instead polling for edits via RPC (and thus making use of connection keepalive). This greatly reduces the overheads involved with each iteration of edit tailing, enabling it to be done much more frequently. Yes,this is one of my concern. Gone through fast path ( HDFS-13150 ) thanks,it can improve. I'm not aware of any good benchmark numbers produced after finishing the feature, maybe csun can provide them? csun can you provide..? I am sure this feature going to be great advantage over rpc workload on ANN, just i want to know write benchmarks also ( as getHAserviceState() and fast editing tailing edits are intrdouced).Sorry for pitching very late..

Chao Sun added a comment - 18/Dec/18 06:36

brahmareddy xkrogen: unfortunately I can't provide enough data points on this. In our production we deployed a slight different version than upstream - the observer hosts are fixed in config so no getHAServiceState is issued (on the downside observer cannot participate in failover). I do intend to run some benchmark with the latest upstream code though. Perhaps will update later.

Chao Sun added a comment - 18/Dec/18 06:36 brahmareddy xkrogen : unfortunately I can't provide enough data points on this. In our production we deployed a slight different version than upstream - the observer hosts are fixed in config so no getHAServiceState is issued (on the downside observer cannot participate in failover). I do intend to run some benchmark with the latest upstream code though. Perhaps will update later.

Chen Liang added a comment - 18/Dec/18 20:57

Hi brahmareddy

you can see this issue if "dfs.ha.tail-edits.period" is default value.

Yes, with default period of 1min, any read can take up to 1min to finish, this is not specific to "second" call as you were mentioning, but any read. I agree that we need to lower this value. In our environment, we do already have set it to 100ms, and with this setting, I never seen the issue of always the second call timeout as you mentioned, nor getServiceState taking 2 seconds. I was under the impression that you still had the timeout even with setting it to 100ms?

Chen Liang added a comment - 18/Dec/18 20:57 Hi brahmareddy you can see this issue if "dfs.ha.tail-edits.period" is default value. Yes, with default period of 1min, any read can take up to 1min to finish, this is not specific to "second" call as you were mentioning, but any read. I agree that we need to lower this value. In our environment, we do already have set it to 100ms, and with this setting, I never seen the issue of always the second call timeout as you mentioned, nor getServiceState taking 2 seconds. I was under the impression that you still had the timeout even with setting it to 100ms?

Chen Liang added a comment - 19/Dec/18 22:02

Hi brahmareddy,

Some more notes to add:
1. getHAServiceState() only gets called when initialization of client proxies (and of course when existing proxies failed and client reinitialize them). In regular operation, this call will not happen so it should not be a concern in benchmarks.
2. I tried the unit test you shared locally with Observer read enabled/disabled. I did not see difference in terms of mkdir time, it has been about 2ms the whole time regardless. I saw some degradation on get content summary though. But this is due to that the unit test is doing mkdir -> getContentSummary -> getFileStatus -> repeat. So the client is constantly switching between write and read, and thus constantly switching between proxies(NNs). This is not the IO pattern Observer is mainly targeting for, and probably the worst case for Observer read because every single getContentSummary call here could potentially trigger Observer catch up wait.

Chen Liang added a comment - 19/Dec/18 22:02 Hi brahmareddy , Some more notes to add: 1. getHAServiceState() only gets called when initialization of client proxies (and of course when existing proxies failed and client reinitialize them). In regular operation, this call will not happen so it should not be a concern in benchmarks. 2. I tried the unit test you shared locally with Observer read enabled/disabled. I did not see difference in terms of mkdir time, it has been about 2ms the whole time regardless. I saw some degradation on get content summary though. But this is due to that the unit test is doing mkdir -> getContentSummary -> getFileStatus -> repeat. So the client is constantly switching between write and read, and thus constantly switching between proxies(NNs). This is not the IO pattern Observer is mainly targeting for, and probably the worst case for Observer read because every single getContentSummary call here could potentially trigger Observer catch up wait.

Hadoop QA added a comment - 22/Dec/18 06:14

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 22s	Docker mode activated.
			Prechecks
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 26 new or modified test files.
			trunk Compile Tests
0	mvndep	1m 0s	Maven dependency ordering for branch
+1	mvninstall	18m 55s	trunk passed
+1	compile	14m 45s	trunk passed
+1	checkstyle	3m 19s	trunk passed
+1	mvnsite	4m 38s	trunk passed
+1	shadedclient	19m 3s	branch has no errors when building and testing our client artifacts.
0	findbugs	0m 0s	Skipped patched modules with no Java source: hadoop-hdfs-project/hadoop-hdfs-native-client
+1	findbugs	7m 53s	trunk passed
+1	javadoc	3m 51s	trunk passed
			Patch Compile Tests
0	mvndep	0m 23s	Maven dependency ordering for patch
+1	mvninstall	3m 56s	the patch passed
+1	compile	14m 52s	the patch passed
+1	cc	14m 52s	the patch passed
-1	javac	14m 52s	root generated 196 new + 1294 unchanged - 196 fixed = 1490 total (was 1490)
-0	checkstyle	3m 50s	root: The patch generated 29 new + 2555 unchanged - 10 fixed = 2584 total (was 2565)
+1	mvnsite	4m 53s	the patch passed
-1	whitespace	0m 0s	The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1	xml	0m 2s	The patch has no ill-formed XML file.
+1	shadedclient	10m 50s	patch has no errors when building and testing our client artifacts.
0	findbugs	0m 0s	Skipped patched modules with no Java source: hadoop-hdfs-project/hadoop-hdfs-native-client
+1	findbugs	8m 24s	the patch passed
+1	javadoc	3m 48s	the patch passed
			Other Tests
+1	unit	8m 26s	hadoop-common in the patch passed.
+1	unit	1m 51s	hadoop-hdfs-client in the patch passed.
-1	unit	75m 2s	hadoop-hdfs in the patch failed.
+1	unit	6m 14s	hadoop-hdfs-native-client in the patch passed.
+1	unit	17m 35s	hadoop-hdfs-rbf in the patch passed.
-1	unit	87m 42s	hadoop-yarn-server-resourcemanager in the patch failed.
+1	asflicense	0m 42s	The patch does not generate ASF License warnings.
		317m 31s

Reason	Tests
Failed junit tests	hadoop.hdfs.web.TestWebHdfsTimeouts
	hadoop.hdfs.server.datanode.TestDirectoryScanner
	hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap
	hadoop.hdfs.server.namenode.TestNestedEncryptionZones

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f
JIRA Issue	~~HDFS-12943~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12952748/HDFS-12943-003.patch
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc xml
uname	Linux 2f96ecadf91b 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/patchprocess/precommit/personality/provided.sh
git revision	trunk / f82922d
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_181
findbugs	v3.1.0-RC1
javac	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/diff-compile-javac-root.txt
checkstyle	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/diff-checkstyle-root.txt
whitespace	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/whitespace-eol.txt
unit	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
unit	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
Test Results	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/testReport/
Max. process+thread count	3626 (vs. ulimit of 10000)
modules	C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-native-client hadoop-hdfs-project/hadoop-hdfs-rbf hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: .
Console output	https://builds.apache.org/job/PreCommit-HDFS-Build/25846/console
Powered by	Apache Yetus 0.8.0 http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 22/Dec/18 06:14 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 26 new or modified test files. trunk Compile Tests 0 mvndep 1m 0s Maven dependency ordering for branch +1 mvninstall 18m 55s trunk passed +1 compile 14m 45s trunk passed +1 checkstyle 3m 19s trunk passed +1 mvnsite 4m 38s trunk passed +1 shadedclient 19m 3s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-hdfs-project/hadoop-hdfs-native-client +1 findbugs 7m 53s trunk passed +1 javadoc 3m 51s trunk passed Patch Compile Tests 0 mvndep 0m 23s Maven dependency ordering for patch +1 mvninstall 3m 56s the patch passed +1 compile 14m 52s the patch passed +1 cc 14m 52s the patch passed -1 javac 14m 52s root generated 196 new + 1294 unchanged - 196 fixed = 1490 total (was 1490) -0 checkstyle 3m 50s root: The patch generated 29 new + 2555 unchanged - 10 fixed = 2584 total (was 2565) +1 mvnsite 4m 53s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 2s The patch has no ill-formed XML file. +1 shadedclient 10m 50s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-hdfs-project/hadoop-hdfs-native-client +1 findbugs 8m 24s the patch passed +1 javadoc 3m 48s the patch passed Other Tests +1 unit 8m 26s hadoop-common in the patch passed. +1 unit 1m 51s hadoop-hdfs-client in the patch passed. -1 unit 75m 2s hadoop-hdfs in the patch failed. +1 unit 6m 14s hadoop-hdfs-native-client in the patch passed. +1 unit 17m 35s hadoop-hdfs-rbf in the patch passed. -1 unit 87m 42s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 42s The patch does not generate ASF License warnings. 317m 31s Reason Tests Failed junit tests hadoop.hdfs.web.TestWebHdfsTimeouts hadoop.hdfs.server.datanode.TestDirectoryScanner hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap hadoop.hdfs.server.namenode.TestNestedEncryptionZones Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f JIRA Issue HDFS-12943 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12952748/HDFS-12943-003.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc xml uname Linux 2f96ecadf91b 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / f82922d maven version: Apache Maven 3.3.9 Default Java 1.8.0_181 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/25846/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/25846/testReport/ Max. process+thread count 3626 (vs. ulimit of 10000) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-native-client hadoop-hdfs-project/hadoop-hdfs-rbf hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/25846/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.

Konstantin Shvachko added a comment - 24/Dec/18 18:15

I just merged ~~HDFS-12943~~ branch to trunk. Thank you everybody for contributing.
Will keep this open for the last few outstanding sub-tasks.

Konstantin Shvachko added a comment - 24/Dec/18 18:15 I just merged HDFS-12943 branch to trunk. Thank you everybody for contributing. Will keep this open for the last few outstanding sub-tasks.

xiangheng added a comment - 23/Jan/19 10:19

There are still some issues that have not been solved, which may affect the consistency of standby reads. Can we test the performance of standby reads in a real cluster environment now?and what should we focus on?

xiangheng added a comment - 23/Jan/19 10:19 There are still some issues that have not been solved, which may affect the consistency of standby reads. Can we test the performance of standby reads in a real cluster environment now?and what should we focus on?

Zhe Zhang added a comment - 22/Feb/19 07:20 - edited

vagarychen has tested the current version of the feature on a real cluster, and can verify the aspects that have already been verified. I think weichiu has also done some tests.

Zhe Zhang added a comment - 22/Feb/19 07:20 - edited vagarychen has tested the current version of the feature on a real cluster, and can verify the aspects that have already been verified. I think weichiu has also done some tests.

Konstantin Shvachko added a comment - 01/Nov/19 00:27

Closing this as Fixed. The feature has been tested, back-ported down to 2.10 and released. Few remaining subtasks are being addressed as usual issues.
Added release notes. Please review if I missed anything.

Thank you everybody for contributing to this effort.

Konstantin Shvachko added a comment - 01/Nov/19 00:27 Closing this as Fixed. The feature has been tested, back-ported down to 2.10 and released. Few remaining subtasks are being addressed as usual issues. Added release notes. Please review if I missed anything. Thank you everybody for contributing to this effort.

Hudson added a comment - 01/Nov/19 03:49

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17592 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17592/)
Add 2.10.0 release notes for ~~HDFS-12943~~ (jhung: rev ef9d12df24c0db76fd37a95551db7920d27d740c)

(edit) hadoop-common-project/hadoop-common/src/site/markdown/release/2.10.0/RELEASENOTES.2.10.0.md

Hudson added a comment - 01/Nov/19 03:49 SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17592 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17592/ ) Add 2.10.0 release notes for HDFS-12943 (jhung: rev ef9d12df24c0db76fd37a95551db7920d27d740c) (edit) hadoop-common-project/hadoop-common/src/site/markdown/release/2.10.0/RELEASENOTES.2.10.0.md

zhangkai added a comment - 14/Jan/20 10:26

When client use getBlockLocation to access the observer node, the observer node will fail to update the access time of file.

So we forbid the getBlockLocation now.

Are there any other solution to deal with it?

zhangkai added a comment - 14/Jan/20 10:26 When client use getBlockLocation to access the observer node, the observer node will fail to update the access time of file. So we forbid the getBlockLocation now. Are there any other solution to deal with it?

Chen Liang added a comment - 14/Jan/20 17:32

lindy_hopper access time update is a write call so it can not be processed by Observer. Access time should be turned off on Observer, as mentioned in ~~HDFS-14959~~.

Chen Liang added a comment - 14/Jan/20 17:32 lindy_hopper access time update is a write call so it can not be processed by Observer. Access time should be turned off on Observer, as mentioned in HDFS-14959 .

Konstantin Shvachko added a comment - 17/Jan/20 01:37

Hey lindy_hopper yes we currently recommend turning off access time updates on Observers as vagarychen said.
We plan to bring aTime updates back with ~~HDFS-15118~~. Observer will bounce such getBlockLocation() calls to Active, so that it could actually update the time.

Konstantin Shvachko added a comment - 17/Jan/20 01:37 Hey lindy_hopper yes we currently recommend turning off access time updates on Observers as vagarychen said. We plan to bring aTime updates back with HDFS-15118 . Observer will bounce such getBlockLocation() calls to Active, so that it could actually update the time.

People

Assignee:: Konstantin Shvachko

Reporter:: Konstantin Shvachko

Votes:: 4 Vote for this issue

Watchers:: 87 Start watching this issue

Dates

Created:: 19/Dec/17 20:16

Updated:: 20/Jan/21 01:56

Resolved:: 01/Nov/19 00:27

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

21.5h

Include sub-tasks

Hadoop HDFS

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Time Tracking

Not Specified