Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11745

Enable more JSON parsing options for parsing non-standard JSON files

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: SQL
    • Labels:

      Description

      As a user, I want to be able to read non-standard JSON files. Jackson itself includes a few options that we should allow users to specify:

      • ALLOW_COMMENTS
      • ALLOW_UNQUOTED_FIELD_NAMES
      • ALLOW_SINGLE_QUOTES
      • ALLOW_NUMERIC_LEADING_ZEROS
      • ALLOW_NON_NUMERIC_NUMBERS

      After this change, the following options are still unsupported:

      • ALLOW_YAML_COMMENTS
      • ALLOW_UNQUOTED_CONTROL_CHARS
      • ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER

      See the Jackson source code pasted below for the definition of these config options:

      
              /**
               * Feature that determines whether parser will allow use
               * of Java/C++ style comments (both '/'+'*' and
               * '//' varieties) within parsed content or not.
               *<p>
               * Since JSON specification does not mention comments as legal
               * construct,
               * this is a non-standard feature; however, in the wild
               * this is extensively used. As such, feature is
               * <b>disabled by default</b> for parsers and must be
               * explicitly enabled.
               */
              ALLOW_COMMENTS(false),
      
              /**
               * Feature that determines whether parser will allow use
               * of YAML comments, ones starting with '#' and continuing
               * until the end of the line. This commenting style is common
               * with scripting languages as well.
               *<p>
               * Since JSON specification does not mention comments as legal
               * construct,
               * this is a non-standard feature. As such, feature is
               * <b>disabled by default</b> for parsers and must be
               * explicitly enabled.
               */
              ALLOW_YAML_COMMENTS(false),
              
              /**
               * Feature that determines whether parser will allow use
               * of unquoted field names (which is allowed by Javascript,
               * but not by JSON specification).
               *<p>
               * Since JSON specification requires use of double quotes for
               * field names,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_UNQUOTED_FIELD_NAMES(false),
      
              /**
               * Feature that determines whether parser will allow use
               * of single quotes (apostrophe, character '\'') for
               * quoting Strings (names and String values). If so,
               * this is in addition to other acceptabl markers.
               * but not by JSON specification).
               *<p>
               * Since JSON specification requires use of double quotes for
               * field names,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_SINGLE_QUOTES(false),
      
              /**
               * Feature that determines whether parser will allow
               * JSON Strings to contain unquoted control characters
               * (ASCII characters with value less than 32, including
               * tab and line feed characters) or not.
               * If feature is set false, an exception is thrown if such a
               * character is encountered.
               *<p>
               * Since JSON specification requires quoting for all control characters,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_UNQUOTED_CONTROL_CHARS(false),
      
              /**
               * Feature that can be enabled to accept quoting of all character
               * using backslash qooting mechanism: if not enabled, only characters
               * that are explicitly listed by JSON specification can be thus
               * escaped (see JSON spec for small list of these characters)
               *<p>
               * Since JSON specification requires quoting for all control characters,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER(false),
      
              /**
               * Feature that determines whether parser will allow
               * JSON integral numbers to start with additional (ignorable) 
               * zeroes (like: 000001). If enabled, no exception is thrown, and extra
               * nulls are silently ignored (and not included in textual representation
               * exposed via {@link JsonParser#getText}).
               *<p>
               * Since JSON specification does not allow leading zeroes,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_NUMERIC_LEADING_ZEROS(false),
              
              /**
               * Feature that allows parser to recognize set of
               * "Not-a-Number" (NaN) tokens as legal floating number
               * values (similar to how many other data formats and
               * programming language source code allows it).
               * Specific subset contains values that
               * <a href="http://www.w3.org/TR/xmlschema-2/">XML Schema</a>
               * (see section 3.2.4.1, Lexical Representation)
               * allows (tokens are quoted contents, not including quotes):
               *<ul>
               *  <li>"INF" (for positive infinity), as well as alias of "Infinity"
               *  <li>"-INF" (for negative infinity), alias "-Infinity"
               *  <li>"NaN" (for other not-a-numbers, like result of division by zero)
               *</ul>
               *<p>
               * Since JSON specification does not allow use of such values,
               * this is a non-standard feature, and as such disabled by default.
               */
               ALLOW_NON_NUMERIC_NUMBERS(false),
      

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'rxin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9724

          Show
          apachespark Apache Spark added a comment - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9724
          Hide
          Cazen Cazen Lee added a comment -

          Good Day Reynold Xin This is Cazen

          I'm sorry for asking question, but could you let me know why ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER option has been unsupported?

          Recently, I created jira issue SPARK-12537 to support this, and I wonder that is there a reason to disable 3 option you mentioned

          Thank you in advance!

          Show
          Cazen Cazen Lee added a comment - Good Day Reynold Xin This is Cazen I'm sorry for asking question, but could you let me know why ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER option has been unsupported? Recently, I created jira issue SPARK-12537 to support this, and I wonder that is there a reason to disable 3 option you mentioned Thank you in advance!
          Hide
          rxin Reynold Xin added a comment -

          No particular reason – I just didn't run into those problems myself.

          The YAML case might not make sense since our JSON can only span one row.

          Show
          rxin Reynold Xin added a comment - No particular reason – I just didn't run into those problems myself. The YAML case might not make sense since our JSON can only span one row.

            People

            • Assignee:
              rxin Reynold Xin
              Reporter:
              rxin Reynold Xin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development