Hadoop Common
  1. Hadoop Common
  2. HADOOP-4212

New lines and leading spaces are not trimmed of a value when configuration is read

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.18.1
    • Fix Version/s: None
    • Component/s: conf
    • Labels:
      None
    • Environment:

      Generic

      Description

      While configuration value is read the leading and trailing spaces and new line characters are taken into account.

      1. HADOOP-4212-TESTCASE.patch
        0.9 kB
        Sreekanth Ramakrishnan
      2. HADOOP-4212-1.patch
        2 kB
        Sreekanth Ramakrishnan

        Issue Links

          Activity

          Hide
          Sreekanth Ramakrishnan added a comment -

          Create following property in configuration file :

            <property>
              <name>test.property.name1</name>
              <value>
                 100
              </value>
            </property>
          

          Try to programatically get the value of property test.property.name1 it would not be equal to "100" but would be equal to "\n100\n"

          Show
          Sreekanth Ramakrishnan added a comment - Create following property in configuration file : <property> <name>test.property.name1</name> <value> 100 </value> </property> Try to programatically get the value of property test.property.name1 it would not be equal to "100" but would be equal to "\n100\n"
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching test case used for testing this bug

          Show
          Sreekanth Ramakrishnan added a comment - Attaching test case used for testing this bug
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch with updated test case.

          • Introducing an attribute for the value called preserve-whitespace . if the value of the attribute is set to true. Then value is read as is with leading and trailing whitespaces. Example :
               <property>
                <name>test.key.withoutwhitespace</name>
                <value>
                  Test Value
                </value>
              </property>
              <property>
              <name>test.key.withwhitespace</name>
                <value preserve-whitespace="true">
                  Test Value
                </value>
              </property>
            

          The value of the test.key.withoutwhitespace would be "Test Value" and test.key.withwhitespace is "\nTest Value\n".

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch with updated test case. Introducing an attribute for the value called preserve-whitespace . if the value of the attribute is set to true. Then value is read as is with leading and trailing whitespaces. Example : <property> <name>test.key.withoutwhitespace</name> <value> Test Value </value> </property> <property> <name>test.key.withwhitespace</name> <value preserve-whitespace="true"> Test Value </value> </property> The value of the test.key.withoutwhitespace would be "Test Value" and test.key.withwhitespace is "\nTest Value\n".
          Hide
          Tom White added a comment -

          How about using XML's xml:space="preserve" instead (http://www.w3.org/TR/REC-xml/#sec-white-space)?

          Show
          Tom White added a comment - How about using XML's xml:space="preserve" instead ( http://www.w3.org/TR/REC-xml/#sec-white-space)?
          Hide
          Steve Loughran added a comment -

          Are there any places where preserving leading/trailing whitespace is depended on?

          Show
          Steve Loughran added a comment - Are there any places where preserving leading/trailing whitespace is depended on?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Only place it would and should be dependent on is Filesystem path's but does trailing space make sense with respect to them? Won't it be problematic when looking at windows file system?

          With respect to xml:space attribute it would be use only if the parser which we use is a XML validating parser. I dont see that we are using XML validating parser when parsing for the configuration. So should we now parse all configuration using validation turned on?

          Moreover, there is a similar issue reported here HADOOP-2366

          Show
          Sreekanth Ramakrishnan added a comment - Only place it would and should be dependent on is Filesystem path's but does trailing space make sense with respect to them? Won't it be problematic when looking at windows file system? With respect to xml:space attribute it would be use only if the parser which we use is a XML validating parser. I dont see that we are using XML validating parser when parsing for the configuration. So should we now parse all configuration using validation turned on? Moreover, there is a similar issue reported here HADOOP-2366
          Hide
          Steve Loughran added a comment -

          -You can look for the xml:space attribute on any element and act on it; when working with XSD-schema'd docs I think xerces behaves differently when it hits it, but I forget these things.

          -yes, it would cause windows to behave differently and not allow filenames with trailing spaces, or other strings. But I dont see that filenames with trailing spaces and carriage returns do actually make sense, even on windows. Spaces mid-path, maybe, but leading or trailing? Danger.

          FWIW, I'm not using the XML format for our configurations; we use our own configuration format

          http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/src/org/smartfrog/services/hadoop/components/hadoopconfiguration.sf?view=markup

          Looking at the current declarations, there's nowhere where white space is useful, and there are places (in comma separated lists), where it may already be harmful and need filtering. There may be some inconsistency between filenames (HADOOP-2366) and user group information, where spaces between words are allowed in hadoop.job.ugi. I would propose

          -consistent filtering of spaces wherever lists are taken (strip leading, trailing),
          -trim leading, tailing whitespace

          What may make sense is to allow quoted whitespace, so you could have a list of directories, those in quotes would be passed down as is:

          <name>dfs.data.dir</name>
          <value>/mnt/hstore2/hdfs , "/home/user2/temp hadoop dir"</value>

          This would resolve to a list with two entries ["/mnt/hstore2/hdfs","/home/user2/temp hadoop dir"]

          Show
          Steve Loughran added a comment - -You can look for the xml:space attribute on any element and act on it; when working with XSD-schema'd docs I think xerces behaves differently when it hits it, but I forget these things. -yes, it would cause windows to behave differently and not allow filenames with trailing spaces, or other strings. But I dont see that filenames with trailing spaces and carriage returns do actually make sense, even on windows. Spaces mid-path, maybe, but leading or trailing? Danger. FWIW, I'm not using the XML format for our configurations; we use our own configuration format http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/src/org/smartfrog/services/hadoop/components/hadoopconfiguration.sf?view=markup Looking at the current declarations, there's nowhere where white space is useful, and there are places (in comma separated lists), where it may already be harmful and need filtering. There may be some inconsistency between filenames ( HADOOP-2366 ) and user group information, where spaces between words are allowed in hadoop.job.ugi. I would propose -consistent filtering of spaces wherever lists are taken (strip leading, trailing), -trim leading, tailing whitespace What may make sense is to allow quoted whitespace, so you could have a list of directories, those in quotes would be passed down as is: <name>dfs.data.dir</name> <value>/mnt/hstore2/hdfs , "/home/user2/temp hadoop dir"</value> This would resolve to a list with two entries ["/mnt/hstore2/hdfs","/home/user2/temp hadoop dir"]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Won't it be problematic when looking at windows file system?

          Configuration is not a part of file system API. It should support generic usage.

          When I worked on HADOOP-2461, I think that the property names should be trimmed but not the values. Otherwise, it forbids the potential use of leading and trailing spaces. If there is a need, the codes using the conf values should do the trimming.

          Show
          Tsz Wo Nicholas Sze added a comment - > Won't it be problematic when looking at windows file system? Configuration is not a part of file system API. It should support generic usage. When I worked on HADOOP-2461 , I think that the property names should be trimmed but not the values. Otherwise, it forbids the potential use of leading and trailing spaces. If there is a need, the codes using the conf values should do the trimming.
          Hide
          Sreekanth Ramakrishnan added a comment -

          The most of the use cases, except the filename's don't require a trailing whitespace for instance HADOOP-4416, So shouldn't we just treat file paths as special case and allow users to mention path which have trailing and leading spaces in special way? That way we can make sure that the user intends to use them instead of adding a space by accident?

          Show
          Sreekanth Ramakrishnan added a comment - The most of the use cases, except the filename's don't require a trailing whitespace for instance HADOOP-4416 , So shouldn't we just treat file paths as special case and allow users to mention path which have trailing and leading spaces in special way? That way we can make sure that the user intends to use them instead of adding a space by accident?
          Hide
          Chris Douglas added a comment -

          The most of the use cases, except the filename's don't require a trailing whitespace for instance HADOOP-4416, So shouldn't we just treat file paths as special case and allow users to mention path which have trailing and leading spaces in special way? That way we can make sure that the user intends to use them instead of adding a space by accident?

          I disagree. Configuration should assume the semantics of the value Strings are known to the user defining the property. Where the user instructs the Configuration to interpret the value, the semantics are explicit and sanitizing input is reasonable. In addition to resolving class names, it would be reasonable to trim values interpreted in getInt, getLong, etc., though it would still be an incompatible change (IIRC, the default value is returned for invalid input). It's certainly an incompatible change in the general case. I think we should close this issue as "Won't fix" and consider broadening the scope of HADOOP-4416.

          Show
          Chris Douglas added a comment - The most of the use cases, except the filename's don't require a trailing whitespace for instance HADOOP-4416 , So shouldn't we just treat file paths as special case and allow users to mention path which have trailing and leading spaces in special way? That way we can make sure that the user intends to use them instead of adding a space by accident? I disagree. Configuration should assume the semantics of the value Strings are known to the user defining the property. Where the user instructs the Configuration to interpret the value, the semantics are explicit and sanitizing input is reasonable. In addition to resolving class names, it would be reasonable to trim values interpreted in getInt, getLong, etc., though it would still be an incompatible change (IIRC, the default value is returned for invalid input). It's certainly an incompatible change in the general case. I think we should close this issue as "Won't fix" and consider broadening the scope of HADOOP-4416 .
          Hide
          Sreekanth Ramakrishnan added a comment -

          I think we should have a generic way of reading values than to expect each implementer of new getXXX() method in Configuration or a class which subclasses configuration to remember he has to deal with a leading or a trailing space.

          In addition to resolving class names, it would be reasonable to trim values interpreted in getInt, getLong, etc., though it would still be an incompatible change

          As per the suggestion all that remains untouched in the Configuration getXXX methods are getLocalPath() and getFile(). I still feel that getString() method should be trimmed and be passed to the user. For I have one use case which I am mentioning here:

          An HADOOP administrator configures a job queue: with name A and accidentially adds a space at the end.

          User looks at the ./hadoop queue list and finds out there is a job queue A, and does not notice the extra space which is hidden in output in console and in web. And mentions in his jobconf to submit to job queue A, the system checks for queue information in job sees that there is no queue called A(without space i.e.) and submits the job to the default queue. Which is wrong.

          You might argue that as an implementer, I should do checking with trimming the space, but then again this can cause bugs to due accidental mis-configuration. I would lean in towards trimming the space unless explicitly mentioned not to for getString method.

          Show
          Sreekanth Ramakrishnan added a comment - I think we should have a generic way of reading values than to expect each implementer of new getXXX() method in Configuration or a class which subclasses configuration to remember he has to deal with a leading or a trailing space. In addition to resolving class names, it would be reasonable to trim values interpreted in getInt, getLong, etc., though it would still be an incompatible change As per the suggestion all that remains untouched in the Configuration getXXX methods are getLocalPath() and getFile(). I still feel that getString() method should be trimmed and be passed to the user. For I have one use case which I am mentioning here: An HADOOP administrator configures a job queue: with name A and accidentially adds a space at the end. User looks at the ./hadoop queue list and finds out there is a job queue A, and does not notice the extra space which is hidden in output in console and in web. And mentions in his jobconf to submit to job queue A, the system checks for queue information in job sees that there is no queue called A(without space i.e.) and submits the job to the default queue. Which is wrong. You might argue that as an implementer, I should do checking with trimming the space, but then again this can cause bugs to due accidental mis-configuration. I would lean in towards trimming the space unless explicitly mentioned not to for getString method.
          Hide
          Chris Douglas added a comment -

          User looks at the ./hadoop queue list and finds out there is a job queue A, and does not notice the extra space which is hidden in output in console and in web. And mentions in his jobconf to submit to job queue A, the system checks for queue information in job sees that there is no queue called A(without space i.e.) and submits the job to the default queue. Which is wrong.

          You might argue that as an implementer, I should do checking with trimming the space, but then again this can cause bugs to due accidental mis-configuration.

          I don't follow. Configuration should call trim before returning the queue name, but if the callee(s) were to call trim on the same String, it "can cause bugs [due to] accidental misconfiguration?"

          Show
          Chris Douglas added a comment - User looks at the ./hadoop queue list and finds out there is a job queue A, and does not notice the extra space which is hidden in output in console and in web. And mentions in his jobconf to submit to job queue A, the system checks for queue information in job sees that there is no queue called A(without space i.e.) and submits the job to the default queue. Which is wrong. You might argue that as an implementer, I should do checking with trimming the space, but then again this can cause bugs to due accidental mis-configuration. I don't follow. Configuration should call trim before returning the queue name, but if the callee(s) were to call trim on the same String, it "can cause bugs [due to] accidental misconfiguration?"
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This duplicate HADOOP-2366.

          Show
          Tsz Wo Nicholas Sze added a comment - This duplicate HADOOP-2366 .

            People

            • Assignee:
              Sreekanth Ramakrishnan
              Reporter:
              Sreekanth Ramakrishnan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development