Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Provide Hadoop metrics showing operational details of the S3Guard implementation.
The metrics will be implemented in this ticket:
● S3GuardRechecksNthPercentileLatency (MutableQuantiles) Percentile time spent
in rechecks attempting to achieve consistency. Repeated for multiple percentile values
of N. This metric is an indicator of the additional latency cost of running S3A with
S3Guard.
● S3GuardRechecksNumOps (MutableQuantiles) Number of times a consistency
recheck was required while attempting to achieve consistency.
● S3GuardStoreNthPercentileLatency (MutableQuantiles) Percentile time spent in
operations against the consistent store, including both write operations during file system
mutations and read operations during file system consistency checks. Repeated for
multiple percentile values of N. This metric is an indicator of latency to the consistent
store implementation.
● S3GuardConsistencyStoreNumOps (MutableQuantiles) Number of operations
against the consistent store, including both write operations during file system mutations
and read operations during file system consistency checks.
● S3GuardConsistencyStoreFailures (MutableCounterLong) Number of failures
during operations against the consistent store implementation.
● S3GuardConsistencyStoreTimeouts (MutableCounterLong) Number of timeouts
during operations against the consistent store implementation.
● S3GuardInconsistencies (MutableCounterLong) C ount of times S3Guard failed to
achieve consistency, even after exhausting all rechecks. A high count may indicate
unexpected outofband modification of the S3 bucket contents, such as by an external
tool that does not make corresponding updates to the consistent store.
Attachments
Attachments
- HADOOP-13453-HADOOP-13345-005.patch
- 12 kB
- Steve Loughran
- HADOOP-13453-HADOOP-13345-004.patch
- 11 kB
- Steve Loughran
- HADOOP-13453-HADOOP-13345-003.patch
- 18 kB
- Ai Deng
- HADOOP-13453-HADOOP-13345-002.patch
- 13 kB
- Ai Deng
- HADOOP-13453-HADOOP-13345-001.patch
- 6 kB
- Ai Deng
Activity
- these should all go into org.apache.hadoop.fs.s3a.Statistic & Instrumentation
- tracking total & ongoing dynamoDB request rates could be useful, as it will help identify when you've over/under provisioned your DDB.
- include stats on detected inconsistencies
- include in S3AFileSystem.toString.
Hello stevel@apache.org actually, I think as all the metrics (in this story) are send by the new implementation of s3Guard, maybe it's better to separate this new metrics code with S3AInstrumentation. The currently S3AInstrumentation has already 800 lines. We can more easily to disable the s3 guard metrics if we separate the two. But I'm not sure how much codes are reusable in S3AInstrumentation for the new metrics.
Also I don't find any tests for S3AInstrumentation, how we test these metrics system in Hadoop?
Sorry for the basic question, i'm really new for work on Hadoop code base.
They're going to have to go into that file because those are the metrics published by the S3A filesystem when deployed, returned by S3AStorageStatistics in a call to S3AFileSystem.getStorageStatistics(), and printed in {{S3AFileSystem.toString(). We could choose whether to add the specific metrics to every S3a FS instance; that's something to consider. Listing the values but returning 0 for all gauges and counters is the most consistent.
Don't worry about the class length: if you look at it in detail, there's two nested classes + support methods explicitly for output/input streams...you don't need to go there. The rest of the code is fairly simple
- add new values to org.apache.hadoop.fs.s3a.Statistic; prefix s3guard_
- In S3AInstrumentation, add counters to the array COUNTERS_TO_CREATE; gauges to GAUGES_TO_CREATE
- Pass in an instance of the instrumentation down to S3Guard
- have the code call incrementCounter and increment/decrementGauge as appropriate
- I'd like a simple counter of s3guard_enabled and s3guard_authoritative, which will be 0 when there's no s3guard running, 1 when the respective booleans are up. Why? Remote visibility
You make a good point, "where are the tests?". The answer is: the metrics can be used to test the internal state of the S3 classes, therefore become implicitly tested there.
Take a look at ITestS3ADirectoryPerformance for a key example of this: our test cases use the counters of the various HTTP operations as the means to verify that API calls work as expected. (note that s3guard, by reducing these, has complicated the tests)
That is, you verify the counters work by asserting that they change as you make operations to the DFS. see: http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html for more of my thinking here
Sorry for the basic question, i'm really new for work on Hadoop code base.
happy to explain my reasoning. We've all started off staring at a vast amount of code that we don't understand; there are still big bits of Hadoop that I don't go near.
steve_l Thank you very much for the explication, that's very helpful.
I have 2 questions for the moment, for sure there are more to come.
- I see 2 pattern to change the counter value in S3AInstrumentation, have a proper method like fileCreated() or pass one Statistic to the generic method incrementCounter(), it is for a reason we keep both? Looks like you suggest to use the second approach.
- I can't find any usage of S3AFileSystem.getStorageStatistics() in the project, what is the main propose of this statistics? it's for use outside of Hadoop? I don't need pass an instance of storageStatistics to S3Guard? In S3AFileSystem, we always increment the both.
protected void incrementStatistic(Statistic statistic, long count) { instrumentation.incrementCounter(statistic, count); storageStatistics.incrementCounter(statistic, count); }
Hi, don't worry about asking questions, we'll do our best to get you contributing code —it benefits all of us if you are adding code to Hadoop.
The split between low level increment named counter and more elegant "event with internal counters?". The event ones are cleaner, as they stop the rest of the code having to know exactly which counters/gauges to use. Consider the elegant ones the best approach, and the direct invocation us being lazy.
The S3aInstrumentation class also has a set of explicit named counters "filesDeleted" as well as lots of ones that are only listed in the arrays GAUGES_TO_CREATE and COUNTERS_TO_CREATE. That's evolution over time; I got bored of having to name and register lots of fields, and realised I could do it from the arrays, at the cost of a hash lookup on every increment.
Outside the S3a class itself, i've tried to have external inner classes to do the counting, with the results merged in at the end (example: the input and output streams), with the inner classes using simple long values, rather than atomics. Why? Eliminates any delays during increments, and lets us override the toString() values for input/output streams with dumps of the values (go on, try it!). We can have many input/output streams per FS instance, so the risk of contention for atomic int/log values is potentially quite high.
I think for s3guard we could add a new inner class passed in to each s3guard instance; it would export the various methods for events that s3guard could raise, such as tableCreated(), tableDeleted() —these can directly increment the atomic counters in the instrumentation, as we'd only have a 1:1 map of S3aFS instance and a s3guard store instance.
Regarding access the statistics, that's hooked up to FileSystem.getStorageStatistics(), which is intended to provide the storage stats for any FS; s3a and HDFS share common statistic names for the common statistics. The latest versions of Tez do collect the statistics of jobs, and so give you the aggregate statistics across your entire query. Until now, only Filesystem.getStatistics() has been used, which returns a fixed set of values (bytes read/written, etc). Spark still only collects those; it'd take some migration to hadoop 2.8+ to pick up the new data. Until then, it's something we can use in tests.
Steve, thank you for sharing these knowledge and thought. It's a good idea to having a inner class for S3guard metrics.
I have started a little with all your help, but I will be on holiday for next two weeks (back to China for the new year). I really hope I can resolve this ticket (could work more quick on this after the holiday), but if the timing is not match to the plan of Hadoop13345, please affect this ticket to someone else, so we can finish in time.
I will try to catch up with you in China.
—don't worry about being on holiday for the next few weeks; a fair few people have gone off to enjoy themselves. Take a break from your emails and enjoy yourself!
I will comment as I see new places in the code that could use metrics:
HADOOP-13904(if it gets committed) in DynamoDBMetadataStore#retryBackoff()
fabbri Thanks.
stevel@apache.org I just made a simple change. Could you please check that? (wip patch) Just make sure I'm on the right way to doing things. Thanks.
thanks, I've hit the submit patch button, but jenkins will fail as the patch isn't going to apply
the way Hadoop builds work is you need to include the branch name if its not trunk, here
HADOOP-13453-HADOOP-13345-001.patch
Try that with the existing code and you should have the machines review the patch (which we rely on to do the basic checks).
For object store tests we also require the submitter to declare which s3 infrastructure they tested against —because Jenkins doesn't run those tests. Here's the details: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
Patch-wise, base design looks good: you've added a new statistic and passing the instrumentation down to add it.
We just need to think of all the statistics to collect.
Thanks for doing this.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 0s | Docker mode activated. |
-1 | patch | 0m 7s | |
Subsystem | Report/Notes |
---|---|
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12853617/HADOOP-13453.wip-01.patch |
Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11689/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 17s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
+1 | mvninstall | 13m 5s | |
+1 | compile | 0m 24s | |
+1 | checkstyle | 0m 16s | |
+1 | mvnsite | 0m 27s | |
+1 | mvneclipse | 0m 22s | |
+1 | findbugs | 0m 33s | |
+1 | javadoc | 0m 16s | |
+1 | mvninstall | 0m 23s | the patch passed |
+1 | compile | 0m 19s | the patch passed |
+1 | javac | 0m 19s | the patch passed |
+1 | checkstyle | 0m 12s | the patch passed |
+1 | mvnsite | 0m 22s | the patch passed |
+1 | mvneclipse | 0m 11s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 0m 36s | the patch passed |
+1 | javadoc | 0m 12s | the patch passed |
+1 | unit | 0m 36s | hadoop-aws in the patch passed. |
+1 | asflicense | 0m 16s | The patch does not generate ASF License warnings. |
20m 9s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12854675/HADOOP-13453-HADOOP-13345-001.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux 929c7a818845 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11717/testReport/ |
modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11717/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 17s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
+1 | mvninstall | 14m 1s | |
+1 | compile | 0m 19s | |
+1 | checkstyle | 0m 15s | |
+1 | mvnsite | 0m 21s | |
+1 | mvneclipse | 0m 22s | |
+1 | findbugs | 0m 28s | |
+1 | javadoc | 0m 14s | |
+1 | mvninstall | 0m 17s | the patch passed |
+1 | compile | 0m 17s | the patch passed |
+1 | javac | 0m 17s | the patch passed |
+1 | checkstyle | 0m 11s | the patch passed |
+1 | mvnsite | 0m 18s | the patch passed |
+1 | mvneclipse | 0m 11s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 0m 32s | the patch passed |
+1 | javadoc | 0m 11s | the patch passed |
+1 | unit | 0m 35s | hadoop-aws in the patch passed. |
+1 | asflicense | 0m 17s | The patch does not generate ASF License warnings. |
20m 25s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12854676/HADOOP-13453-HADOOP-13345-002.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux 72388b8af098 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11718/testReport/ |
modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11718/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Hi stevel@apache.org Thank you for the information. I will keep adding more metrics. (still on the S3Guard class level)
Regards of the metrics mentioned in the document (I have copied it to the Jira ticket), as we consider the metadata in the store is "fresher", and use it first, we don't do any recheck(for the inconsistent between S3 and metadata store) right? So the metrics "S3GuardRechecksNthPercentileLatency", "S3GuardRechecksNumOps", "S3GuardInconsistencies" will not need any more.
For the Jenkins build, I can't find the "Submit" button on the Jira, it because my user permission or I miss something?
Looks like the Jenkins run automatically for the patch. Will modify the existing S3Guard tests scenarios to test the metrics added.
I'm afraid HADOOP-13914 has just broken the patch, which means, sadly, you get to do the merge. Let's get this in before anything else traumatic comes in, so other patches get to suffer next time.
I like what you've done measuring latency as well as counts. I think we could actually do this more broadly. I think the timing counting should be in a finally() clause though, so timings for failures get included too. (side issue: count success and failures separately? with different timings?)
I would like to think about how we could avoiding having to pass the instrumentation around all the time. Ideally, we could just pass it in as a constructor to the metadata store. Alternatively, that store could collect metrics and we could wire it up, but I don't see an easy way to do that in Hadoop metrics (compared to Coda Hale's). The easiest would be just to pass in the S3AInstrumentation (or an inner class) down, but currently the metastore interface is not specific to S3A only.
If we add an interface for metadata store instrumentation, then S3AInstrumentation can implement it in an inner class, and S3AFS can pass it down during initialization. Th's would let the metastore do all it wants, with well defined strings, of course.
What do people think?
Hi stevel@apache.org, I have added a new patch following your suggestion. If it is ok, we can discuss the metrics we want to add?
I come out this list of operation and latency metrics for this ticket, can you check if I miss anything? Thank you.
Status:
S3GUARD_METADATASTORE_ENABLED
S3GUARD_METADATASTORE_IS_AUTHORITATIVE
Operations:
S3GUARD_METADATASTORE_INITIALIZATION
S3GUARD_METADATASTORE_DELETE_PATH
S3GUARD_METADATASTORE_DELETE_PATH_LATENCY
S3GUARD_METADATASTORE_DELETE_SUBTREE_PATCH
S3GUARD_METADATASTORE_GET_PATH
S3GUARD_METADATASTORE_GET_PATH_LATENCY
S3GUARD_METADATASTORE_GET_CHILDREN_PATH
S3GUARD_METADATASTORE_GET_CHILDREN_PATH_LATENCY
S3GUARD_METADATASTORE_MOVE_PATH
S3GUARD_METADATASTORE_PUT_PATH
S3GUARD_METADATASTORE_PUT_PATH_LATENCY
S3GUARD_METADATASTORE_CLOSE
S3GUARD_METADATASTORE_DESTORY
From S3Guard:
S3GUARD_METADATASTORE_MERGE_DIRECTORY
For the failures:
S3GUARD_METADATASTORE_DELETE_FAILURE
S3GUARD_METADATASTORE_GET_FAILURE
S3GUARD_METADATASTORE_PUT_FAILURE
Etc:
S3GUARD_METADATASTORE_PUT_RETRY_TIMES
I think maybe measure the number of path has been operated (put, get … ) in MetaStore could be interesting. The end user can see how big their S3 file system has been managed in S3Guard.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 21s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
+1 | mvninstall | 12m 34s | |
+1 | compile | 0m 20s | |
+1 | checkstyle | 0m 15s | |
+1 | mvnsite | 0m 22s | |
+1 | mvneclipse | 0m 14s | |
+1 | findbugs | 0m 27s | |
+1 | javadoc | 0m 15s | |
+1 | mvninstall | 0m 18s | the patch passed |
+1 | compile | 0m 18s | the patch passed |
+1 | javac | 0m 18s | the patch passed |
-0 | checkstyle | 0m 12s | hadoop-tools/hadoop-aws: The patch generated 19 new + 25 unchanged - 0 fixed = 44 total (was 25) |
+1 | mvnsite | 0m 20s | the patch passed |
+1 | mvneclipse | 0m 11s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 0m 32s | the patch passed |
+1 | javadoc | 0m 11s | the patch passed |
+1 | unit | 0m 36s | hadoop-aws in the patch passed. |
+1 | asflicense | 0m 16s | The patch does not generate ASF License warnings. |
18m 59s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12859449/HADOOP-13453-HADOOP-13345-003.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux bd705d5c15d7 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
checkstyle | https://builds.apache.org/job/PreCommit-HADOOP-Build/11851/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-aws.txt |
Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11851/testReport/ |
modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11851/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
I see your point about quantile config: don't know what to do there.
- be careful with those imports, S3AInstrumentation had all its imports expanded and moved around. Import sections are one of the main merge-conflict areas, so its critical to keep changes there to a minimum. We normally turn off any IDE automatic features to avoid this.
- moved off the split of separate interface and impl for the instrumentation; everything is closely couple enough we don't need to abstract things away.
I'm doing a patch with my changes; if everything is happy and I can run a full integration test suite (HADOOP-14216 has broken this), then I'll +1 it; once it is in we can expand the metrics
Patch 003 with some tuning.
Something is wrong with my IDE (IntelliJ IDEA 2017.1) and it is reorganising imports without warning. I think I've fixed it here.
Tested: none. Someone has gone and broken XInclude
Patch 005; accidentally lost an import while cleaning up IDE import games.
Tested: s3a ireland with the opts -Dparallel-tests -DtestsThreadCount=8 -Ddynamo. All well. Best test run for ages. Hopefully that means that DDB is fixing that intermittent root contract test failure.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 18s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
+1 | mvninstall | 16m 0s | |
+1 | compile | 0m 24s | |
+1 | checkstyle | 0m 19s | |
+1 | mvnsite | 0m 27s | |
+1 | mvneclipse | 0m 30s | |
+1 | findbugs | 0m 36s | |
+1 | javadoc | 0m 16s | |
+1 | mvninstall | 0m 24s | the patch passed |
+1 | compile | 0m 22s | the patch passed |
+1 | javac | 0m 22s | the patch passed |
-0 | checkstyle | 0m 13s | hadoop-tools/hadoop-aws: The patch generated 1 new + 38 unchanged - 0 fixed = 39 total (was 38) |
+1 | mvnsite | 0m 23s | the patch passed |
+1 | mvneclipse | 0m 14s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 0m 40s | the patch passed |
+1 | javadoc | 0m 13s | the patch passed |
+1 | unit | 0m 40s | hadoop-aws in the patch passed. |
+1 | asflicense | 0m 30s | The patch does not generate ASF License warnings. |
23m 57s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:612578f |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12863312/HADOOP-13453-HADOOP-13345-005.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux 40fb9ba1fa78 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
checkstyle | https://builds.apache.org/job/PreCommit-HADOOP-Build/12097/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-aws.txt |
Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/12097/testReport/ |
modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/12097/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Hi steve_l happy with patch5, we should push it to the branch and start to add more metrics?
OK, it's in: that's for this
now we have to think about what extra things to measure....
Cool, I listed my suggestion for the metrics in previous comment, what is your thoughts? Let's decide the list first.
Why not take that list, create a new JIRA off HADOOP-13345 "add more s3guard metrics" and suggest those as the start.
One interesting one to see if we could detect would be mismatches between s3guard and the underlying object store: if we can observe inconsistencies (how?) then that should be measured. The S3mper blog posts looks at how netflix detected consistency issues in S3 that way