Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30334

Add metadata around semi-structured columns to Spark

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.4
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

      The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

       - format: The format of the semi-structured column, e.g. json, xml, avro

       - options: Options for parsing these columns

      Then imagine having the following data:

      +------------+-------+--------------------+
      |     ts     | event |        raw         |
      +------------+-------+--------------------+
      | 2019-10-12 | click | {"field":"value"}  |
      +------------+-------+--------------------+ 

      SELECT raw.field FROM data

      will return "value"

      or the following data

      +------------+-------+----------------------+
      |     ts     | event |         raw          |
      +------------+-------+----------------------+
      | 2019-10-12 | click | field1=v1|field2=v2  |
      +------------+-------+----------------------+ 

      SELECT raw.field1 FROM data

      will return v1.

       

      As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                brkyvz Burak Yavuz
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: