Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42690

Implement CSV/JSON parsing funcions

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Connect
    • None

    Description

      Implement the following two methods in DataFrameReader:

       

       

      /**
      * Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
      * text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
      *
      * Unless the schema is specified using `schema` function, this function goes through the
      * input once to determine the input schema.
      *
      * @param jsonDataset input Dataset with one JSON object per record
      * @since 3.4.0
      */
      def json(jsonDataset: Dataset[String]): DataFrame
      /**
      * Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`.
      *
      * If the schema is not specified using `schema` function and `inferSchema` option is enabled,
      * this function goes through the input once to determine the input schema.
      *
      * If the schema is not specified using `schema` function and `inferSchema` option is disabled,
      * it determines the columns as string types and it reads only the first line to determine the
      * names and the number of fields.
      *
      * If the enforceSchema is set to `false`, only the CSV header in the first line is checked
      * to conform specified or inferred schema.
      *
      * @note if `header` option is set to `true` when calling this API, all lines same with
      * the header will be removed if exists.
      *
      * @param csvDataset input Dataset with one CSV row per record
      * @since 3.4.0
      */
      def csv(csvDataset: Dataset[String]): DataFrame
      

       

      For this we need a new message. We cannot use project because we don't know the schema upfront.

       

      message Parse {
        // (Required) Input relation to Parse. The input is expected to have single text column.
        Relation input = 1;
        // (Required) The expected format of the text.
        ParseFormat format = 2;
        enum ParseFormat {
          PARSE_FORMAT_UNSPECIFIED = 0;
          PARSE_FORMAT_CSV = 1;
          PARSE_FORMAT_JSON = 2;
        }
      }
      

       

       

      Attachments

        Activity

          People

            LuciferYang Yang Jie
            hvanhovell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: