Hadoop Common
  1. Hadoop Common
  2. HADOOP-2366

Space in the value for dfs.data.dir can cause great problems

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: conf
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following configuration causes problems:

      <property>
      <name>dfs.data.dir</name>
      <value>/mnt/hstore2/hdfs, /home/foo/dfs</value>
      <description>
      Determines where on the local filesystem an DFS data node should store its bl
      ocks. If this is a comma-delimited list of directories, then data will be stor
      ed in all named directories, typically on different devices. Directories that
      do not exist are ignored.
      </description>
      </property>

      The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named <SPACE> which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.

      My proposed solution would be to trimLeft all path names from this and similar property after splitting on comma. This still allows spaces in file and directory names but avoids this problem.

      1. HADOOP-2366.patch
        6 kB
        Michele Catasta

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #13 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/13/)
          . Support trimmed strings in Configuration. Contributed by Michele Catasta

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #13 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/13/ ) . Support trimmed strings in Configuration. Contributed by Michele Catasta
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Michele!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Michele!
          Hide
          Michele Catasta added a comment -

          Tsz Wo: build successful for "ant test"

           
          -1 overall.  
          +1 @author.  The patch does not contain any @author tags.
          +1 tests included.  The patch appears to include 3 new or modified tests.
          +1 javadoc.  The javadoc tool did not generate any warning messages.
           -1 javac.  The applied patch generated 64 javac compiler warnings (more than the trunk's current 124 warnings).
          +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          for "ant test-patch". No idea about the warnings, probably I messed up something in forrest/findbugs?

          Show
          Michele Catasta added a comment - Tsz Wo: build successful for "ant test" -1 overall. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 64 javac compiler warnings (more than the trunk's current 124 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. for "ant test-patch". No idea about the warnings, probably I messed up something in forrest/findbugs?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Michele, could you run "ant test-patch" and "ant test" over your patch, and then post the results?

          Show
          Tsz Wo Nicholas Sze added a comment - Michele, could you run "ant test-patch" and "ant test" over your patch, and then post the results?
          Hide
          Michele Catasta added a comment -

          Same logic as the previous patch, without the HDFS part.
          I'll create a new issue in the HDFS project linked to this one.

          Show
          Michele Catasta added a comment - Same logic as the previous patch, without the HDFS part. I'll create a new issue in the HDFS project linked to this one.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          After the project split, this problem needs two separated patches, one for Common and the other one for HDFS. I think we should change Configuration and StringUtils in this issue and create a new HDFS issue for the rest.

          Show
          Tsz Wo Nicholas Sze added a comment - After the project split, this problem needs two separated patches, one for Common and the other one for HDFS. I think we should change Configuration and StringUtils in this issue and create a new HDFS issue for the rest.
          Hide
          Michele Catasta added a comment -

          Is there something I should do to trigger again the Hudson build with the new patch?

          Show
          Michele Catasta added a comment - Is there something I should do to trigger again the Hudson build with the new patch?
          Hide
          Todd Lipcon added a comment -

          Can someone with the correct privileges make Michele a contributor and reassign this issue?

          Show
          Todd Lipcon added a comment - Can someone with the correct privileges make Michele a contributor and reassign this issue?
          Hide
          Michele Catasta added a comment -

          As suggested by Raghu, patch now modifies also NameNode behavior.
          NN was using directly getStringCollection(), so I just added a new getTrimmedStringCollection().

          Show
          Michele Catasta added a comment - As suggested by Raghu, patch now modifies also NameNode behavior. NN was using directly getStringCollection(), so I just added a new getTrimmedStringCollection().
          Hide
          Raghu Angadi added a comment -

          New interface sounds better. NameNode also reads a list of directories. Could you check if this fix is required there as well?

          Show
          Raghu Angadi added a comment - New interface sounds better. NameNode also reads a list of directories. Could you check if this fix is required there as well?
          Hide
          Michele Catasta added a comment - - edited

          I attached a separate patch which exposes a getTrimmedStrings() method and modifies only the behavior of the Datanode.
          When the property value is empty, the behavior of getTrimmedStrings is by purpose different from getStrings(). I'd rather pass an empty array which is handled correctly by Datanode.instance(), than a null which causes NPE.
          Just let me know if you want me to fix this behavior and make it consistent with the other methods, mine was just a proposal.

          Todd: I run a whole ant clean test cycle with the other patch, and it fails anyway. I gave a shallow look with greps on getStrings() when I was attaching the first version, and I had your same impression. Actually, after that fix you suggested, I was expecting the tests to run smoothly, but probably there's something missing still. I'll leave the patch attached, in case it could be useful.

          Tsz Wo: thanks for the hints. run-test-core was successful, hope will be the same with the Hudson build.

          Show
          Michele Catasta added a comment - - edited I attached a separate patch which exposes a getTrimmedStrings() method and modifies only the behavior of the Datanode. When the property value is empty, the behavior of getTrimmedStrings is by purpose different from getStrings(). I'd rather pass an empty array which is handled correctly by Datanode.instance(), than a null which causes NPE. Just let me know if you want me to fix this behavior and make it consistent with the other methods, mine was just a proposal. Todd: I run a whole ant clean test cycle with the other patch, and it fails anyway. I gave a shallow look with greps on getStrings() when I was attaching the first version, and I had your same impression. Actually, after that fix you suggested, I was expecting the tests to run smoothly, but probably there's something missing still. I'll leave the patch attached, in case it could be useful. Tsz Wo: thanks for the hints. run-test-core was successful, hope will be the same with the Hudson build.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > If tests pass after fixing the behavior on an empty conf, do you have an issue with changing the semantics of these utility functions so long as the new behavior is clearly documented in the javadoc?

          Configuration is a public class and is not a part of fs or hdfs. Trimming the string values may make sense in fs/hdfs paths but it may not for the other usages. Personally, I wish the trimming was done in the beginning. Unfortunately, it was not. If we change it now, then it breaks existing semantics. I think that users rarely use leading or tailing spaces in configuration values but we cannot break them.

          When I worked on HADOOP-2461, I think that the property names should be trimmed but not the values. Otherwise, it forbids the potential use of leading and trailing spaces. If there is a need, the codes using the conf values should do the trimming. In this issue, only the values for dfs.data.dir should be trimmed.

          If a trimmed version of getStrings(..) is needed, I think it is better to provide new methods, say getTrimmedStrings(..) in Configuration and StringUtils but not changing the existing ones.

          Show
          Tsz Wo Nicholas Sze added a comment - > If tests pass after fixing the behavior on an empty conf, do you have an issue with changing the semantics of these utility functions so long as the new behavior is clearly documented in the javadoc? Configuration is a public class and is not a part of fs or hdfs. Trimming the string values may make sense in fs/hdfs paths but it may not for the other usages. Personally, I wish the trimming was done in the beginning. Unfortunately, it was not. If we change it now, then it breaks existing semantics. I think that users rarely use leading or tailing spaces in configuration values but we cannot break them. When I worked on HADOOP-2461 , I think that the property names should be trimmed but not the values. Otherwise, it forbids the potential use of leading and trailing spaces. If there is a need, the codes using the conf values should do the trimming. In this issue, only the values for dfs.data.dir should be trimmed. If a trimmed version of getStrings(..) is needed, I think it is better to provide new methods, say getTrimmedStrings(..) in Configuration and StringUtils but not changing the existing ones.
          Hide
          Todd Lipcon added a comment -

          Tsz Wo: In reviewing this patch, I grepped for other usages of StringUtils.getStrings and getStringCollection. It appears to me that the only user other than Configuration is StreamJob for handling some command line flags (where the whitespace trimming also makes good sense). Inside Configuration, it's used for getClasses as well as various file-path related things. Looking at uses of conf.getStrings I also can't find any that look like they would cause an issue.

          If tests pass after fixing the behavior on an empty conf, do you have an issue with changing the semantics of these utility functions so long as the new behavior is clearly documented in the javadoc?

          Show
          Todd Lipcon added a comment - Tsz Wo: In reviewing this patch, I grepped for other usages of StringUtils.getStrings and getStringCollection. It appears to me that the only user other than Configuration is StreamJob for handling some command line flags (where the whitespace trimming also makes good sense). Inside Configuration, it's used for getClasses as well as various file-path related things. Looking at uses of conf.getStrings I also can't find any that look like they would cause an issue. If tests pass after fixing the behavior on an empty conf, do you have an issue with changing the semantics of these utility functions so long as the new behavior is clearly documented in the javadoc?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Michele, StringUtils contain common utility codes, which are not only used in HDFS but also other parts of Hadoop. Therefore, changing the behavior of methods like getStringCollection(..) is very likely to break other existing codes. That's why there were so many tests failed.

          In this issue, I think we should only trim the values of dfs.data.dir, but not all conf values. See also HADOOP-4212.

          Thank you for working on this.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Michele, StringUtils contain common utility codes, which are not only used in HDFS but also other parts of Hadoop. Therefore, changing the behavior of methods like getStringCollection(..) is very likely to break other existing codes. That's why there were so many tests failed. In this issue, I think we should only trim the values of dfs.data.dir, but not all conf values. See also HADOOP-4212 . Thank you for working on this.
          Hide
          Michele Catasta added a comment -

          Patch modified to make getStrings() behave as before with whitespace-only strings.

          Todd: thanks a lot for the feedback, I'm running an 'ant test' in a local server to see if tests are failing again. Feel free to move back the issue to "In Progress" status - I'll let you know as soon as the build ends.
          Sorry for the naiveness, I didn't know that Hudson was triggered by the "Patch Available" status... skipped an important section in the wiki page, shame on me!

          Show
          Michele Catasta added a comment - Patch modified to make getStrings() behave as before with whitespace-only strings. Todd: thanks a lot for the feedback, I'm running an 'ant test' in a local server to see if tests are failing again. Feel free to move back the issue to "In Progress" status - I'll let you know as soon as the build ends. Sorry for the naiveness, I didn't know that Hudson was triggered by the "Patch Available" status... skipped an important section in the wiki page, shame on me!
          Hide
          Todd Lipcon added a comment -

          Lots of test failures on this one. Michele: can you look into these failures please? Thanks

          Show
          Todd Lipcon added a comment - Lots of test failures on this one. Michele: can you look into these failures please? Thanks
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12410578/HADOOP-2366.patch
          against trunk revision 784663.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12410578/HADOOP-2366.patch against trunk revision 784663. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/499/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          I think this patch changes the behavior of getStrings when the configured value is the empty string "":

          bsh % System.out.println("".split("\\s*,\\s*").length);
          1
          bsh % System.out.println("".split("\\s*,\\s*")[0]);    
          
          bsh % import java.util.StringTokenizer;
          bsh % StringTokenizer tk = new StringTokenizer("", ",");
          bsh % System.out.println(tk.hasMoreTokens());
          false
          

          maybe the condition just needs the additional !str.trim().equals("")

          Show
          Todd Lipcon added a comment - I think this patch changes the behavior of getStrings when the configured value is the empty string "": bsh % System.out.println("".split("\\s*,\\s*").length); 1 bsh % System.out.println("".split("\\s*,\\s*")[0]); bsh % import java.util.StringTokenizer; bsh % StringTokenizer tk = new StringTokenizer("", ","); bsh % System.out.println(tk.hasMoreTokens()); false maybe the condition just needs the additional !str.trim().equals("")
          Hide
          Michele Catasta added a comment - - edited

          Patch updated, now it's using

          split("\\s*,\\s*")

          @tlipcon: Thanks for the comment! Gotta be honest, I wasn't using the regex because I thought mine was the only way to let getStrings() behave as it was doing before regarding trailing empty tokens.
          Actually, I took a look at the code which is using getStrings(), and throwing away the trailing empty token should not break anything (while helps the users who leave a final comma without any following path). Anyway, to make it behave as it was before, just add a -1 as the second argument of split(). Hope it's OK now

          Show
          Michele Catasta added a comment - - edited Patch updated, now it's using split( "\\s*,\\s*" ) @tlipcon: Thanks for the comment! Gotta be honest, I wasn't using the regex because I thought mine was the only way to let getStrings() behave as it was doing before regarding trailing empty tokens. Actually, I took a look at the code which is using getStrings(), and throwing away the trailing empty token should not break anything (while helps the users who leave a final comma without any following path). Anyway, to make it behave as it was before, just add a -1 as the second argument of split(). Hope it's OK now
          Hide
          Todd Lipcon added a comment -

          Patch looks OK, but why not use str.split("\\s*,
          s*") as suggested by Craig above? Then you can simply use Arrays.asList and not have to iterate over the split array to create an ArrayList.

          Show
          Todd Lipcon added a comment - Patch looks OK, but why not use str.split("\\s*, s*") as suggested by Craig above? Then you can simply use Arrays.asList and not have to iterate over the split array to create an ArrayList.
          Hide
          Michele Catasta added a comment -

          Patch updated, it was adding a warning on StringUtils.

          Show
          Michele Catasta added a comment - Patch updated, it was adding a warning on StringUtils.
          Hide
          Michele Catasta added a comment -

          getStrings() trims leading and trailing whitespace by default now.
          Patch includes also a simple unit test.

          Show
          Michele Catasta added a comment - getStrings() trims leading and trailing whitespace by default now. Patch includes also a simple unit test.
          Hide
          steve_l added a comment -

          And by using XML1.0, you have an explicit restriction on not having any characters < 32 in the name other than, possibly, CR, LF and TAB.

          One thing to consider in testing this is to try some high-UTF8 characters and see that everything works with non-ascii filenames, as that checks assumptions quite well.

          Show
          steve_l added a comment - And by using XML1.0, you have an explicit restriction on not having any characters < 32 in the name other than, possibly, CR, LF and TAB. One thing to consider in testing this is to try some high-UTF8 characters and see that everything works with non-ascii filenames, as that checks assumptions quite well.
          Hide
          Raghu Angadi added a comment -

          +1 for trimming on either side of the dir path.

          It is ok not to have the flexibility to specify directories with spaces at the end. If a user really wants to use such directories, as symlink could be used. Note that we already have restriction that directory names can not have ',' in their name.

          Show
          Raghu Angadi added a comment - +1 for trimming on either side of the dir path. It is ok not to have the flexibility to specify directories with spaces at the end. If a user really wants to use such directories, as symlink could be used. Note that we already have restriction that directory names can not have ',' in their name.
          Hide
          Craig Macdonald added a comment - - edited

          Why stop on trim left?
          IMHO you should assume that the whitespace is part of the delimitation, not of the pathname.

          I'm not sure where the split happens to look at the code, but I suggest:

          String parts[] = value.split("\\s*,\\s*");
          

          And/Or pathnames with space could be placed in quotes which are trimmed off, however, double quote is also an acceptable file/diretory name on GNU/Linux.

          Show
          Craig Macdonald added a comment - - edited Why stop on trim left? IMHO you should assume that the whitespace is part of the delimitation, not of the pathname. I'm not sure where the split happens to look at the code, but I suggest: String parts[] = value.split( "\\s*,\\s*" ); And/Or pathnames with space could be placed in quotes which are trimmed off, however, double quote is also an acceptable file/diretory name on GNU/Linux.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > a directory with a whitespace name is hard to understand.

          You may be right but systems like GNU/Linux support this. For example,

          [latest]$ mkdir " "
          [latest]$ mkdir " "/foo
          [latest]$ cd \ /foo/
          [foo]$ pwd
          /home/tsz/hadoop/latest/ /foo
          
          Show
          Tsz Wo Nicholas Sze added a comment - > a directory with a whitespace name is hard to understand. You may be right but systems like GNU/Linux support this. For example, [latest]$ mkdir " " [latest]$ mkdir " "/foo [latest]$ cd \ /foo/ [foo]$ pwd /home/tsz/hadoop/latest/ /foo

            People

            • Assignee:
              Michele Catasta
              Reporter:
              Ted Dunning
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development