Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11745

Enable more JSON parsing options for parsing non-standard JSON files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0
    • SQL

    Description

      As a user, I want to be able to read non-standard JSON files. Jackson itself includes a few options that we should allow users to specify:

      • ALLOW_COMMENTS
      • ALLOW_UNQUOTED_FIELD_NAMES
      • ALLOW_SINGLE_QUOTES
      • ALLOW_NUMERIC_LEADING_ZEROS
      • ALLOW_NON_NUMERIC_NUMBERS

      After this change, the following options are still unsupported:

      • ALLOW_YAML_COMMENTS
      • ALLOW_UNQUOTED_CONTROL_CHARS
      • ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER

      See the Jackson source code pasted below for the definition of these config options:

              /**
               * Feature that determines whether parser will allow use
               * of Java/C++ style comments (both '/'+'*' and
               * '//' varieties) within parsed content or not.
               *<p>
               * Since JSON specification does not mention comments as legal
               * construct,
               * this is a non-standard feature; however, in the wild
               * this is extensively used. As such, feature is
               * <b>disabled by default</b> for parsers and must be
               * explicitly enabled.
               */
              ALLOW_COMMENTS(false),
      
              /**
               * Feature that determines whether parser will allow use
               * of YAML comments, ones starting with '#' and continuing
               * until the end of the line. This commenting style is common
               * with scripting languages as well.
               *<p>
               * Since JSON specification does not mention comments as legal
               * construct,
               * this is a non-standard feature. As such, feature is
               * <b>disabled by default</b> for parsers and must be
               * explicitly enabled.
               */
              ALLOW_YAML_COMMENTS(false),
              
              /**
               * Feature that determines whether parser will allow use
               * of unquoted field names (which is allowed by Javascript,
               * but not by JSON specification).
               *<p>
               * Since JSON specification requires use of double quotes for
               * field names,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_UNQUOTED_FIELD_NAMES(false),
      
              /**
               * Feature that determines whether parser will allow use
               * of single quotes (apostrophe, character '\'') for
               * quoting Strings (names and String values). If so,
               * this is in addition to other acceptabl markers.
               * but not by JSON specification).
               *<p>
               * Since JSON specification requires use of double quotes for
               * field names,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_SINGLE_QUOTES(false),
      
              /**
               * Feature that determines whether parser will allow
               * JSON Strings to contain unquoted control characters
               * (ASCII characters with value less than 32, including
               * tab and line feed characters) or not.
               * If feature is set false, an exception is thrown if such a
               * character is encountered.
               *<p>
               * Since JSON specification requires quoting for all control characters,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_UNQUOTED_CONTROL_CHARS(false),
      
              /**
               * Feature that can be enabled to accept quoting of all character
               * using backslash qooting mechanism: if not enabled, only characters
               * that are explicitly listed by JSON specification can be thus
               * escaped (see JSON spec for small list of these characters)
               *<p>
               * Since JSON specification requires quoting for all control characters,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER(false),
      
              /**
               * Feature that determines whether parser will allow
               * JSON integral numbers to start with additional (ignorable) 
               * zeroes (like: 000001). If enabled, no exception is thrown, and extra
               * nulls are silently ignored (and not included in textual representation
               * exposed via {@link JsonParser#getText}).
               *<p>
               * Since JSON specification does not allow leading zeroes,
               * this is a non-standard feature, and as such disabled by default.
               */
              ALLOW_NUMERIC_LEADING_ZEROS(false),
              
              /**
               * Feature that allows parser to recognize set of
               * "Not-a-Number" (NaN) tokens as legal floating number
               * values (similar to how many other data formats and
               * programming language source code allows it).
               * Specific subset contains values that
               * <a href="http://www.w3.org/TR/xmlschema-2/">XML Schema</a>
               * (see section 3.2.4.1, Lexical Representation)
               * allows (tokens are quoted contents, not including quotes):
               *<ul>
               *  <li>"INF" (for positive infinity), as well as alias of "Infinity"
               *  <li>"-INF" (for negative infinity), alias "-Infinity"
               *  <li>"NaN" (for other not-a-numbers, like result of division by zero)
               *</ul>
               *<p>
               * Since JSON specification does not allow use of such values,
               * this is a non-standard feature, and as such disabled by default.
               */
               ALLOW_NON_NUMERIC_NUMBERS(false),
      

      Attachments

        Activity

          People

            rxin Reynold Xin
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: