Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20493

De-deuplicate parse logics for DDL-like type string in R

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SparkR
    • Labels:
      None
    • Target Version/s:

      Description

      It seems we are using SQLUtils.getSQLDataType[1] for type string in structField.

      It looks we can replace this with CatalystSqlParser.parseDataType[2].

      They look similar DDL-like type definitions as below:

      scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      
      scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
      +---+
      | _1|
      +---+
      |[a]|
      +---+
      

      Such type strings looks identical when R’s one as below:

      > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
      > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
        struct
      1      a
      

      It seems R’s one is more stricter because we are checking the types via regular expressions[3] in R side.

      Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce no behaviour changes.

      To make this sure, the tests dedicated for it was added in SPARK-20105.

      [1] https://github.com/apache/spark/blob/d1f6c64c4b763c05d6d79ae5497f298dc3835f3e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L93-L131
      [2] https://github.com/apache/spark/blob/1472cac4bb31c1886f82830778d34c4dd9030d7a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L36-L40
      [3] https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/schema.R#L129-L187

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: