Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: Deployment
    • Labels:
      None

      Description

      There are a few things that can be improved regarding the HDFS helper script introduced by BIGTOP-547. Some of these issues have been discussed in BIGTOP-637.

      1. The script seems to create user directories for users "root" and "jenkins" but not the current user running the script. I think it will be a good idea to add in the script the commands to create the /user/$USER directory in HDFS. Of course, we should be careful in case the user running the command is root or jenkins, in that case a simple mkdir command would throw error given that those directories already exist.

      2. The script uses sudo which requires login shell. However, login shells are not available for certain use cases like init scripts. Consequently, we should consider using su instead.

      3. Look into how the helper script can be made faster

      Folks, please feel free to add/edit if there is something I missed.

      1. 0001-BIGTOP-852.-Improve-HDFS-helper-script.patch
        1 kB
        Bruno Mahé
      2. untarHdfs.groovy
        2 kB
        Roman Shaposhnik

        Issue Links

          Activity

          Hide
          Konstantin Boudnik added a comment -

          Here are few points to start with:

          • Do not create user 'jenkins' in the package: we shouldn't shape the file system this way for a user without knowing the use case.
          • lose shell-outs: use groovy script to directly call in file system API for the meta and check operations. You will speed it up for about 300% easily
          • the script in the current way doesn't solve the lack of YARN-required directories. Hence, you won't be able to use HDFS shaped in this way to run MR jobs. In fact, nodemanager won't even start. The original patch https://issues.apache.org/jira/secure/attachment/12569655/BIGTOP-637.patch has all the details.
          Show
          Konstantin Boudnik added a comment - Here are few points to start with: Do not create user 'jenkins' in the package: we shouldn't shape the file system this way for a user without knowing the use case. lose shell-outs: use groovy script to directly call in file system API for the meta and check operations. You will speed it up for about 300% easily the script in the current way doesn't solve the lack of YARN-required directories. Hence, you won't be able to use HDFS shaped in this way to run MR jobs. In fact, nodemanager won't even start. The original patch https://issues.apache.org/jira/secure/attachment/12569655/BIGTOP-637.patch has all the details.
          Hide
          Mark Grover added a comment - - edited

          Thanks, cos.

          I found another problem, /usr/lib/hive/lib directory contains a symlink:
          hbase.jar -> ../../hbase/hbase.jar

          This symlink exists because of Hive-HBase integration, however, when Hive is installed without HBase, this link is broken.

          Now, the helper script tries to copy over binaries from /usr/lib/hive/lib to oozie's HDFS dir by doing

          if ls /usr/lib/hive/lib/*.jar &> /dev/null; then
            sudo -u hdfs hadoop fs -put /usr/lib/hive/lib/*.jar /user/oozie/share/lib/hive
          fi
          

          However, this breaks due to the broken symlinks, breaking the helper script prematurely since it has a set -ex on the top.

          Show
          Mark Grover added a comment - - edited Thanks, cos. I found another problem, /usr/lib/hive/lib directory contains a symlink: hbase.jar -> ../../hbase/hbase.jar This symlink exists because of Hive-HBase integration, however, when Hive is installed without HBase, this link is broken. Now, the helper script tries to copy over binaries from /usr/lib/hive/lib to oozie's HDFS dir by doing if ls /usr/lib/hive/lib/*.jar &> /dev/ null ; then sudo -u hdfs hadoop fs -put /usr/lib/hive/lib/*.jar /user/oozie/share/lib/hive fi However, this breaks due to the broken symlinks, breaking the helper script prematurely since it has a set -ex on the top.
          Hide
          Konstantin Boudnik added a comment - - edited

          Wow, this is a mess. I would recommend to move to a separate helper script - this kind of functionality is clearly goes too far beyond preparations of the directories in HDFS. E.g. this exposes the knowledge about Oozie dependencies to some tiny script that only has to do a bunch of mkdirs. In fact, I would even insist on removing this part from the script in question.

          That's a real issue, unlike the dups of 3 mkdir commands we've been arguing over elsewhere.

          Show
          Konstantin Boudnik added a comment - - edited Wow, this is a mess. I would recommend to move to a separate helper script - this kind of functionality is clearly goes too far beyond preparations of the directories in HDFS. E.g. this exposes the knowledge about Oozie dependencies to some tiny script that only has to do a bunch of mkdirs. In fact, I would even insist on removing this part from the script in question. That's a real issue, unlike the dups of 3 mkdir commands we've been arguing over elsewhere.
          Hide
          Roman Shaposhnik added a comment -

          Cos has a good point, we've got 2 use cases that need to be addressed here:

          1. initialization of the skeleton directory structure (this is akin to what something like base-files http://packages.ubuntu.com/hardy/base-files does for a real Linux root fs)
          2. a sort of an rsync between local FS and HDFS
          Show
          Roman Shaposhnik added a comment - Cos has a good point, we've got 2 use cases that need to be addressed here: initialization of the skeleton directory structure (this is akin to what something like base-files http://packages.ubuntu.com/hardy/base-files does for a real Linux root fs) a sort of an rsync between local FS and HDFS
          Hide
          Bruno Mahé added a comment -

          Right now, as a first pass, I would rather just ensure yarn is setup correctly so we can unblock BIGOTP-637. We can always open another ticket or discussion regarding a better long term solution.

          To pile on the others suggestions regarding a longer term solution:

          • I don't want to have a jenkins user or some fancy magic to detect the current user. I just want to pass a list of users to that helper script. Maybe it should be a completely different script if easier.
          • I don't want that script to set up all the services. I just want to pass a list of services to be initialized.
          • As Cos suggested, we may want to use some language (groovy wouldn't be my first choice ) instead of individual calls to the hadoop command
          Show
          Bruno Mahé added a comment - Right now, as a first pass, I would rather just ensure yarn is setup correctly so we can unblock BIGOTP-637. We can always open another ticket or discussion regarding a better long term solution. To pile on the others suggestions regarding a longer term solution: I don't want to have a jenkins user or some fancy magic to detect the current user. I just want to pass a list of users to that helper script. Maybe it should be a completely different script if easier. I don't want that script to set up all the services. I just want to pass a list of services to be initialized. As Cos suggested, we may want to use some language (groovy wouldn't be my first choice ) instead of individual calls to the hadoop command
          Hide
          Konstantin Boudnik added a comment - - edited

          we may want to use some language (groovy wouldn't be my first choice )

          I think the question here is performance: shell is as good language as Python. But non of them can work with Java APIs of the Hadoop. Hence, the choice.

          Show
          Konstantin Boudnik added a comment - - edited we may want to use some language (groovy wouldn't be my first choice ) I think the question here is performance: shell is as good language as Python. But non of them can work with Java APIs of the Hadoop. Hence, the choice.
          Hide
          Bruno Mahé added a comment - - edited

          I think the question here is performance: shell is as good language as Python. But non of them can work with Java APIs of the Hadoop. Hence, the choice.

          I am also suggesting to use something other than bash for performance reason. No disagreement there.
          But groovy would still not be my choice and there are other jvm based languages. The language to use does not matter and I will let whoever resolves that ticket pick whatever language (s)he wants as long as the performance improve overall (as long as it does not pull crazy/heavy dependencies).

          Show
          Bruno Mahé added a comment - - edited I think the question here is performance: shell is as good language as Python. But non of them can work with Java APIs of the Hadoop. Hence, the choice. I am also suggesting to use something other than bash for performance reason. No disagreement there. But groovy would still not be my choice and there are other jvm based languages. The language to use does not matter and I will let whoever resolves that ticket pick whatever language (s)he wants as long as the performance improve overall (as long as it does not pull crazy/heavy dependencies).
          Hide
          Mark Grover added a comment -

          Let me a take a jab at this. I will most likely end up using groovy If someone wants to steal before I get to it, feel free to do so.
          Thanks again for all the feedback

          Show
          Mark Grover added a comment - Let me a take a jab at this. I will most likely end up using groovy If someone wants to steal before I get to it, feel free to do so. Thanks again for all the feedback
          Hide
          Bruno Mahé added a comment -

          Mark> Have you made any progress on this?
          I would like to get the VM in a working shaper soon and am wondering if I should fix the missing directories for yarn or just wait for your v2 of the script? What would be your time frame?

          Show
          Bruno Mahé added a comment - Mark> Have you made any progress on this? I would like to get the VM in a working shaper soon and am wondering if I should fix the missing directories for yarn or just wait for your v2 of the script? What would be your time frame?
          Hide
          Mark Grover added a comment -

          Bruno, sorry I haven't gotten around to this. I doubt I will be able to get to it before the end of this week. I'd say please go ahead with whatever you need to do. If there are any changes that I decide to make, I will make sure they are are on top of whatever you submit.
          Thanks!

          Show
          Mark Grover added a comment - Bruno, sorry I haven't gotten around to this. I doubt I will be able to get to it before the end of this week. I'd say please go ahead with whatever you need to do. If there are any changes that I decide to make, I will make sure they are are on top of whatever you submit. Thanks!
          Hide
          Bruno Mahé added a comment -

          Thanks!
          I will get something going so we can have a working VM soon. We will replace the script by your new version whenever you are ready.

          Show
          Bruno Mahé added a comment - Thanks! I will get something going so we can have a working VM soon. We will replace the script by your new version whenever you are ready.
          Hide
          Roman Shaposhnik added a comment -

          Here's what I have in mind (attaching rough prototype). This script lets you untar a tar archive directly into HDFS and seems to be pretty efficient. Just like the regular tar command it preserves perms/ownership which seems exactly like what we need.

          I'm think that I can either turn it into a java program and drop into the hdfs project during the Hadoop build or we can start growing a collection of Groovy scripts in Bigtop.

          Now, if we go the Groovy route, we'd have to also import the Groovy jar and decide where to keep it and all bigtop scripts. I propose /usr/lib/bigtop/lib for jars (including Grovy) and /usr/lib/bigtop/bin for scripts.

          Anyway let me guys know what'd you think.

          Show
          Roman Shaposhnik added a comment - Here's what I have in mind (attaching rough prototype). This script lets you untar a tar archive directly into HDFS and seems to be pretty efficient. Just like the regular tar command it preserves perms/ownership which seems exactly like what we need. I'm think that I can either turn it into a java program and drop into the hdfs project during the Hadoop build or we can start growing a collection of Groovy scripts in Bigtop. Now, if we go the Groovy route, we'd have to also import the Groovy jar and decide where to keep it and all bigtop scripts. I propose /usr/lib/bigtop/lib for jars (including Grovy) and /usr/lib/bigtop/bin for scripts. Anyway let me guys know what'd you think.
          Hide
          Konstantin Boudnik added a comment -

          Love it! +1

          My understanding is the tar file will contain an empty directory structure of everything we'll need in the HDFS right?
          I only have some doubts about this line
          {{ fs.setOwner(file, entry.getUserName(), entry.getGroupName());}}
          for two reasons:

          • ownership will totally depends on the way the tar has been created
          • permissions of the files/directories aren't clearly set. Are preservation of original permissions guaranteed by untar?

          very minor: usage check calls println wheres while loop is calling System.out.printf instead of its shortcut printf

          Show
          Konstantin Boudnik added a comment - Love it! +1 My understanding is the tar file will contain an empty directory structure of everything we'll need in the HDFS right? I only have some doubts about this line {{ fs.setOwner(file, entry.getUserName(), entry.getGroupName());}} for two reasons: ownership will totally depends on the way the tar has been created permissions of the files/directories aren't clearly set. Are preservation of original permissions guaranteed by untar? very minor: usage check calls println wheres while loop is calling System.out.printf instead of its shortcut printf
          Hide
          Mark Grover added a comment -

          Took a quick look, will take a better look later. Can you please upload the usage to correctly reflect the name untarHdfs instead of hdfsuntar.

          Show
          Mark Grover added a comment - Took a quick look, will take a better look later. Can you please upload the usage to correctly reflect the name untarHdfs instead of hdfsuntar .
          Hide
          Bruno Mahé added a comment -

          Seems like a nice start.
          I know it's a rough prototype but some comments in case:

          • It's missing the license header
          • I would rather put code inside objects/functions to avoid ending up with code all over the place
          • Need proper arg parsing
          • Shouldn't we check if the destination directory does not exist before doing anything?
          • I am no groovy expert, but doesn't the script need a shebang?

          Now, if we go the Groovy route, we'd have to also import the Groovy jar and decide where to keep it and all bigtop scripts. I propose /usr/lib/bigtop/lib for jars (including Grovy) and /usr/lib/bigtop/bin for scripts.

          Having some util scripts as part of Apache Bigtop is very tempting, but I am not really enjoying the idea of having to provide groovy. First of all, the layout you propose is very groovy centric (unless you mean something like /usr/lib/bigtop/lib/groovy for the root of all groovy things). If I contribute another script in scala or jython (for the sake of the example), we may end up with conflicting jars. But also, we will end up having to maintain new languages environment. So distributing entire language environments seems quite scary. I would rather distribute simple class files for now.

          Show
          Bruno Mahé added a comment - Seems like a nice start. I know it's a rough prototype but some comments in case: It's missing the license header I would rather put code inside objects/functions to avoid ending up with code all over the place Need proper arg parsing Shouldn't we check if the destination directory does not exist before doing anything? I am no groovy expert, but doesn't the script need a shebang? Now, if we go the Groovy route, we'd have to also import the Groovy jar and decide where to keep it and all bigtop scripts. I propose /usr/lib/bigtop/lib for jars (including Grovy) and /usr/lib/bigtop/bin for scripts. Having some util scripts as part of Apache Bigtop is very tempting, but I am not really enjoying the idea of having to provide groovy. First of all, the layout you propose is very groovy centric (unless you mean something like /usr/lib/bigtop/lib/groovy for the root of all groovy things). If I contribute another script in scala or jython (for the sake of the example), we may end up with conflicting jars. But also, we will end up having to maintain new languages environment. So distributing entire language environments seems quite scary. I would rather distribute simple class files for now.
          Hide
          Bruno Mahé added a comment -

          Here is the patch to fix the script so pi jobs can run

          Show
          Bruno Mahé added a comment - Here is the patch to fix the script so pi jobs can run
          Hide
          Roman Shaposhnik added a comment -

          +1 to the update for the current script

          Show
          Roman Shaposhnik added a comment - +1 to the update for the current script
          Hide
          Peter Linnell added a comment -

          +1 on Bruno's patch

          Show
          Peter Linnell added a comment - +1 on Bruno's patch

            People

            • Assignee:
              Bruno Mahé
              Reporter:
              Mark Grover
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development