Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-2906

Optimization of AvroUtil.toAvroIdentifier

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.7
    • Component/s: None
    • Flags:
      Patch

      Description

      Hi all

      Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is about 500% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.

      Any suggestions or remarks are welcome.

      Kind regards,

      Joeri

      https://github.com/apache/sqoop/pull/18

      Writeup:

      https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

      1. diff.txt
        1 kB
        Joeri Hermans

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Sqoop-hadoop20 #1051 (See https://builds.apache.org/job/Sqoop-hadoop20/1051/)
          SQOOP-2906: Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280)

          • src/java/org/apache/sqoop/avro/AvroUtil.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Sqoop-hadoop20 #1051 (See https://builds.apache.org/job/Sqoop-hadoop20/1051/ ) SQOOP-2906 : Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280 ) src/java/org/apache/sqoop/avro/AvroUtil.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Sqoop-hadoop200 #1055 (See https://builds.apache.org/job/Sqoop-hadoop200/1055/)
          SQOOP-2906: Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280)

          • src/java/org/apache/sqoop/avro/AvroUtil.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Sqoop-hadoop200 #1055 (See https://builds.apache.org/job/Sqoop-hadoop200/1055/ ) SQOOP-2906 : Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280 ) src/java/org/apache/sqoop/avro/AvroUtil.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Sqoop-hadoop100 #1015 (See https://builds.apache.org/job/Sqoop-hadoop100/1015/)
          SQOOP-2906: Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280)

          • src/java/org/apache/sqoop/avro/AvroUtil.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Sqoop-hadoop100 #1015 (See https://builds.apache.org/job/Sqoop-hadoop100/1015/ ) SQOOP-2906 : Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280 ) src/java/org/apache/sqoop/avro/AvroUtil.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Sqoop-hadoop23 #1253 (See https://builds.apache.org/job/Sqoop-hadoop23/1253/)
          SQOOP-2906: Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280)

          • src/java/org/apache/sqoop/avro/AvroUtil.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Sqoop-hadoop23 #1253 (See https://builds.apache.org/job/Sqoop-hadoop23/1253/ ) SQOOP-2906 : Optimization of AvroUtil.toAvroIdentifier (jarcec: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5779aec031d4bb0bccd79329923e5dbad3786280 ) src/java/org/apache/sqoop/avro/AvroUtil.java
          Hide
          jarcec Jarek Jarcec Cecho added a comment -

          Thank you for your contribution Joeri Hermans and for your review Attila Szabo!

          Show
          jarcec Jarek Jarcec Cecho added a comment - Thank you for your contribution Joeri Hermans and for your review Attila Szabo !
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5779aec031d4bb0bccd79329923e5dbad3786280 in sqoop's branch refs/heads/trunk from Jarek Jarcec Cecho
          [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=5779aec ]

          SQOOP-2906: Optimization of AvroUtil.toAvroIdentifier

          (Joeri Hermans via Jarek Jarcec Cecho)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5779aec031d4bb0bccd79329923e5dbad3786280 in sqoop's branch refs/heads/trunk from Jarek Jarcec Cecho [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=5779aec ] SQOOP-2906 : Optimization of AvroUtil.toAvroIdentifier (Joeri Hermans via Jarek Jarcec Cecho)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ac217a032c0755edac713ff53b3f55b7f2f46706 in sqoop's branch refs/heads/trunk from Jarek Jarcec Cecho
          [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=ac217a0 ]

          Revert "SQOOP-2920: sqoop performance deteriorates significantly on wide datasets; sqoop 100% on cpu"

          I've mistakenly committed SQOOP-2920 and SQOOP-2906 inside this commit,
          so I'll revert it and commit them separately.

          Show
          jira-bot ASF subversion and git services added a comment - Commit ac217a032c0755edac713ff53b3f55b7f2f46706 in sqoop's branch refs/heads/trunk from Jarek Jarcec Cecho [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=ac217a0 ] Revert " SQOOP-2920 : sqoop performance deteriorates significantly on wide datasets; sqoop 100% on cpu" I've mistakenly committed SQOOP-2920 and SQOOP-2906 inside this commit, so I'll revert it and commit them separately.
          Hide
          joeri.hermans Joeri Hermans added a comment -

          No problem! Any contribution and suggestion is more than welcome. If you still have issues running it, feel free to send me an e-mail

          Show
          joeri.hermans Joeri Hermans added a comment - No problem! Any contribution and suggestion is more than welcome. If you still have issues running it, feel free to send me an e-mail
          Hide
          maugli Attila Szabo added a comment -

          First of all,

          Joeri, thank you so much help. It's much clearer now! If you would interested in having +1 contributor, I would more than glad to provide pull requests for this project in my freetime (e.g. direct CM investigation solution )..

          The other thing:
          After running a few test I'm quite sure that the current version what Joeri had provided works quite good, and would provided the required performance boost. It is possible maybe later we would be able to identify some synergies between his changes, and changes related to class writer, but I do think we should release it right now.

          Jarek Jarcec Cecho, Abraham Fine,
          Could you please also review the changeset, and provide your feedback or if it also looks alright on your side, put it into upstream?

          Thanks in advance!

          Show
          maugli Attila Szabo added a comment - First of all, Joeri, thank you so much help. It's much clearer now! If you would interested in having +1 contributor, I would more than glad to provide pull requests for this project in my freetime (e.g. direct CM investigation solution ).. The other thing: After running a few test I'm quite sure that the current version what Joeri had provided works quite good, and would provided the required performance boost. It is possible maybe later we would be able to identify some synergies between his changes, and changes related to class writer, but I do think we should release it right now. Jarek Jarcec Cecho , Abraham Fine , Could you please also review the changeset, and provide your feedback or if it also looks alright on your side, put it into upstream? Thanks in advance!
          Hide
          joeri.hermans Joeri Hermans added a comment -

          Hi Attila,

          It is probably my fault a bit as well, I'm still working on it I've added a build script in order to prepare the dependencies.

          These are actually very good questions! Is it OK if I include these in the readme?

          > How did you ensure that the extra parameters are passed to the mappers and the Sqoop1 cmd line tool?
          sqoop import -Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/jdk1.8.0_60/jre" -Dmapreduce.map.java.opts="-XX:+PreserveFramePointer -XX:InlineSmallCode=200" -Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/jdk1.8.0_60/jre" --connect [server] --username [username] --target-dir [hdfs dir] --table [table] -P -m [number of mappers]

          > In case of -c which server/service IP is needed there and in what kind of format?
          In the case of YARN, you will have to specify the REST address in order the fetch the nodes http://[namenode]:8088/ws/v1/cluster/nodes

          > Is it enough to use only with -h?
          Yes, but the sampling duration will be only 99Hz with a sampling duration of 5 seconds. But these default values can be changed by specifying the corresponding parameters, or editing hprofiler.sh.

          > If I don't wanna use SSH keys is it a valid solution if I type my root password all the time when it executes SSH?
          I think this might work, not sure though, because the processes are initiated in parallel, so I don't know how stdin will handle this. However, you can create a file with your password, then cat the file and pipe the password to stdin of the process. You can do this be editing src/host_executor.sh.

          > Should it work just out of box ( e.g. defining -h -j -f -t ) or is any postprocess needed?
          No post-processing is required. However, interpreting the results might be troublesome.

          > Did you use a hadoop distribution like CDH or just pure hdfs and sqoop1?
          Yes, we use Sqoop 1.4.6 on CDH 5.5.1.

          I hope this helps, if you still have any issues free feel to contact me! I'm definitely willing to improve this tool

          Kind regards,

          Joeri

          Show
          joeri.hermans Joeri Hermans added a comment - Hi Attila, It is probably my fault a bit as well, I'm still working on it I've added a build script in order to prepare the dependencies. These are actually very good questions! Is it OK if I include these in the readme? > How did you ensure that the extra parameters are passed to the mappers and the Sqoop1 cmd line tool? sqoop import -Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/jdk1.8.0_60/jre" -Dmapreduce.map.java.opts="-XX:+PreserveFramePointer -XX:InlineSmallCode=200" -Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/jdk1.8.0_60/jre" --connect [server] --username [username] --target-dir [hdfs dir] --table [table] -P -m [number of mappers] > In case of -c which server/service IP is needed there and in what kind of format? In the case of YARN, you will have to specify the REST address in order the fetch the nodes http://[namenode]:8088/ws/v1/cluster/nodes > Is it enough to use only with -h? Yes, but the sampling duration will be only 99Hz with a sampling duration of 5 seconds. But these default values can be changed by specifying the corresponding parameters, or editing hprofiler.sh. > If I don't wanna use SSH keys is it a valid solution if I type my root password all the time when it executes SSH? I think this might work, not sure though, because the processes are initiated in parallel, so I don't know how stdin will handle this. However, you can create a file with your password, then cat the file and pipe the password to stdin of the process. You can do this be editing src/host_executor.sh. > Should it work just out of box ( e.g. defining -h -j -f -t ) or is any postprocess needed? No post-processing is required. However, interpreting the results might be troublesome. > Did you use a hadoop distribution like CDH or just pure hdfs and sqoop1? Yes, we use Sqoop 1.4.6 on CDH 5.5.1. I hope this helps, if you still have any issues free feel to contact me! I'm definitely willing to improve this tool Kind regards, Joeri
          Hide
          maugli Attila Szabo added a comment -

          Dear Joeri,

          I'd like to run a set of measurements on my own ( to check if your solution + mine on this issue are synergies or not and to check further improvements )

          I found this hprofiler quite exciting however I had difficulties in using it ( my bad ).
          Could you help with a the following questions:

          • How did you ensure that the extra parameters are passed to the mappers and the Sqoop1 cmd line tool?
          • In case of -c which server/service IP is needed there and in what kind of format?
          • Is it enough to use only with -h ?
          • If I don't wanna use SSH keys is it a valid solution if I type my root password all the time when it executes SSH?
          • Should it work just out of box ( e.g. defining -h -j -f -t ) or is any postprocess needed?
          • Did you use a hadoop distribution like CDH or just pure hdfs and sqoop1?

          Many thanks in advance!
          Maugli

          Show
          maugli Attila Szabo added a comment - Dear Joeri, I'd like to run a set of measurements on my own ( to check if your solution + mine on this issue are synergies or not and to check further improvements ) I found this hprofiler quite exciting however I had difficulties in using it ( my bad ). Could you help with a the following questions: How did you ensure that the extra parameters are passed to the mappers and the Sqoop1 cmd line tool? In case of -c which server/service IP is needed there and in what kind of format? Is it enough to use only with -h ? If I don't wanna use SSH keys is it a valid solution if I type my root password all the time when it executes SSH? Should it work just out of box ( e.g. defining -h -j -f -t ) or is any postprocess needed? Did you use a hadoop distribution like CDH or just pure hdfs and sqoop1? Many thanks in advance! Maugli
          Hide
          joeri.hermans Joeri Hermans added a comment -

          Hi Attila

          I'm definitely not an expert on Sqoop internals, but architectural changes are definitely better! I'm just publishing the results I have obtained by using the profiling, and providing a fix for this. Of course, if we implement your suggestion, this would definitely benefit the CPU consumption in general. However, for now, we could patch this first (so have the increase in performance already), and then look at what needs to be done in order to correctly implement your idea(s).

          Kind regards,

          Joeri

          Show
          joeri.hermans Joeri Hermans added a comment - Hi Attila I'm definitely not an expert on Sqoop internals, but architectural changes are definitely better! I'm just publishing the results I have obtained by using the profiling, and providing a fix for this. Of course, if we implement your suggestion, this would definitely benefit the CPU consumption in general. However, for now, we could patch this first (so have the increase in performance already), and then look at what needs to be done in order to correctly implement your idea(s). Kind regards, Joeri
          Hide
          maugli Attila Szabo added a comment -

          Hi Joeri,
          I've joined the Sqoop community only a few weeks ago, so maybe I don't see all of the pitfalls, but let me raise a few suggestions/concerns:
          Your fix seems to be okay, but I would suggest a bit more changes processing wise:
          First of all, I would not do the conversion for all of the column names, but rather create a Map<String, String> which would contain the "original" VS. "converted" names, and thus in most of the cases we would just have to lookup the name in O(1) time, rather doing the conversion all the time (even if it's now much faster and cheaper).
          I would also not convert those entry.getKey() values if those got a hit in the schema (schema.getField returns not null), as in that case they're valid values, but maybe this optimization is neglectable if you implement the first proposal.
          I was also considering to do the mapping in advance before the import (after we've got the DB metadata and the avro schema), but for not RDBMS system it might cause problems (different sets of columns for each row e.g.), so I'm not sure that would help, but from algorithmic/clean code POV that would be the cleanest solution if possible.
          Would you tell what do you think about these suggestions?
          My 2cents,
          Attila (Maugli)

          Show
          maugli Attila Szabo added a comment - Hi Joeri, I've joined the Sqoop community only a few weeks ago, so maybe I don't see all of the pitfalls, but let me raise a few suggestions/concerns: Your fix seems to be okay, but I would suggest a bit more changes processing wise: First of all, I would not do the conversion for all of the column names, but rather create a Map<String, String> which would contain the "original" VS. "converted" names, and thus in most of the cases we would just have to lookup the name in O(1) time, rather doing the conversion all the time (even if it's now much faster and cheaper). I would also not convert those entry.getKey() values if those got a hit in the schema (schema.getField returns not null), as in that case they're valid values, but maybe this optimization is neglectable if you implement the first proposal. I was also considering to do the mapping in advance before the import (after we've got the DB metadata and the avro schema), but for not RDBMS system it might cause problems (different sets of columns for each row e.g.), so I'm not sure that would help, but from algorithmic/clean code POV that would be the cleanest solution if possible. Would you tell what do you think about these suggestions? My 2cents, Attila (Maugli)
          Hide
          joeri.hermans Joeri Hermans added a comment -

          Hey Jarek Jarcec Cecho, I've published the patch some time ago https://reviews.apache.org/r/46557/

          Show
          joeri.hermans Joeri Hermans added a comment - Hey Jarek Jarcec Cecho , I've published the patch some time ago https://reviews.apache.org/r/46557/
          Hide
          jarcec Jarek Jarcec Cecho added a comment -

          Could you upload the patch to the review board Joeri Hermans? I do have some comments that will better shared directly in the code

          Show
          jarcec Jarek Jarcec Cecho added a comment - Could you upload the patch to the review board Joeri Hermans ? I do have some comments that will better shared directly in the code
          Hide
          joeri.hermans Joeri Hermans added a comment -

          git diff HEAD

          Show
          joeri.hermans Joeri Hermans added a comment - git diff HEAD
          Hide
          jarcec Jarek Jarcec Cecho added a comment -

          Switching to "Patch available" to denote that this is ready for review.

          Show
          jarcec Jarek Jarcec Cecho added a comment - Switching to "Patch available" to denote that this is ready for review.
          Hide
          jarcec Jarek Jarcec Cecho added a comment -

          Hi Joeri Hermans, thank you very much for picking up this improvement. Sadly the Sqoop project currently do not accept github pull request. Could you please create a text patch (diff between current HEAD and your latest changes) and attach it to the JIRA? Please also upload the patch to review board.

          Show
          jarcec Jarek Jarcec Cecho added a comment - Hi Joeri Hermans , thank you very much for picking up this improvement. Sadly the Sqoop project currently do not accept github pull request. Could you please create a text patch (diff between current HEAD and your latest changes) and attach it to the JIRA? Please also upload the patch to review board .
          Hide
          joeri.hermans Joeri Hermans added a comment -
          Show
          joeri.hermans Joeri Hermans added a comment - Blog post with an analysis as proof for the patch: https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction Kind regards, Joeri
          Hide
          joeri.hermans Joeri Hermans added a comment -

          Hi all

          We tested a custom JAR with this method on our cluster, and we have an increased throughput (writing to Parquet) of 200-300% depending on the load of the cluster.

          Kind regards,

          Joeri

          Show
          joeri.hermans Joeri Hermans added a comment - Hi all We tested a custom JAR with this method on our cluster, and we have an increased throughput (writing to Parquet) of 200-300% depending on the load of the cluster. Kind regards, Joeri
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user JoeriHermans commented on the pull request:

          https://github.com/apache/sqoop/pull/18#issuecomment-208859666

          @stanleyxu2005 The main difference in my implementation is that I only have to do a single copy. StringBuilders are inefficient, char arrays are a lot faster since everything is pre-allocated already. Furthermore, the overhead over constantly allocating a new object, and the fact that internally the StringBuilder will do some copying as well, make this an inefficient approach.

          For example, I implemented your approach and instead of a 500% improvement with our method, I have a 230% increase in performance with the StringBuilder approach. So the char-array is definitely a lot faster, and this is what it is all about, since this is a very prominent function on the stack.

          Kind regards,

          Joeri

          Show
          githubbot ASF GitHub Bot added a comment - Github user JoeriHermans commented on the pull request: https://github.com/apache/sqoop/pull/18#issuecomment-208859666 @stanleyxu2005 The main difference in my implementation is that I only have to do a single copy. StringBuilders are inefficient, char arrays are a lot faster since everything is pre-allocated already. Furthermore, the overhead over constantly allocating a new object, and the fact that internally the StringBuilder will do some copying as well, make this an inefficient approach. For example, I implemented your approach and instead of a 500% improvement with our method, I have a 230% increase in performance with the StringBuilder approach. So the char-array is definitely a lot faster, and this is what it is all about, since this is a very prominent function on the stack. Kind regards, Joeri
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user stanleyxu2005 commented on a diff in the pull request:

          https://github.com/apache/sqoop/pull/18#discussion_r59360059

          — Diff: src/java/org/apache/sqoop/avro/AvroUtil.java —
          @@ -114,11 +114,20 @@ public static String toAvroColumn(String column) {

          • Format candidate to avro specifics
            */
            public static String toAvroIdentifier(String candidate) {
          • String formattedCandidate = candidate.replaceAll("
            W+", "_");
          • if (formattedCandidate.substring(0,1).matches("[a-zA-Z_]")) {
          • return formattedCandidate;
            + char[] data = candidate.toCharArray();
            + int stringIndex = 0;
            +
            + for (char c:data)
            Unknown macro: { + if (Character.isLetterOrDigit(c) || c == '_') { + data[stringIndex++] = c; + } + }

            +
            + char initial = data[0];
            + if (Character.isLetter(initial) || initial == '_') {
            + return new String(data, 0, stringIndex);

              • End diff –

          Your code will first create a char array and then eventually update char in the array. As result you will create another copy as a new String. Have you thought about using a `StringBuilder` directly?
          ```
          final StringBuilder sb = new StringBuilder();
          for (char c : candidate) {
          if (Character.isLetterOrDigit(c) || c == '_')

          { sb.append(c); }

          }
          ...
          return sb.toString();
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user stanleyxu2005 commented on a diff in the pull request: https://github.com/apache/sqoop/pull/18#discussion_r59360059 — Diff: src/java/org/apache/sqoop/avro/AvroUtil.java — @@ -114,11 +114,20 @@ public static String toAvroColumn(String column) { Format candidate to avro specifics */ public static String toAvroIdentifier(String candidate) { String formattedCandidate = candidate.replaceAll(" W+", "_"); if (formattedCandidate.substring(0,1).matches(" [a-zA-Z_] ")) { return formattedCandidate; + char[] data = candidate.toCharArray(); + int stringIndex = 0; + + for (char c:data) Unknown macro: { + if (Character.isLetterOrDigit(c) || c == '_') { + data[stringIndex++] = c; + } + } + + char initial = data [0] ; + if (Character.isLetter(initial) || initial == '_') { + return new String(data, 0, stringIndex); End diff – Your code will first create a char array and then eventually update char in the array. As result you will create another copy as a new String. Have you thought about using a `StringBuilder` directly? ``` final StringBuilder sb = new StringBuilder(); for (char c : candidate) { if (Character.isLetterOrDigit(c) || c == '_') { sb.append(c); } } ... return sb.toString(); ```

            People

            • Assignee:
              joeri.hermans Joeri Hermans
              Reporter:
              joeri.hermans Joeri Hermans
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development