Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33401

Vector type column is not possible to create using spark SQL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Spark Core
    • None

    Description

       

      Created table with vector type column:

      import org.apache.spark.mllib.linalg.Vector
      import org.apache.spark.mllib.linalg.VectorUDT
      import org.apache.spark.mllib.linalg.Vectors
      case class Test(features: Vector) 
      Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
       .write
       .mode("overwrite")
       .saveAsTable("pborshchenko.test_vector_spark_0911_1")
      

       

      Show the create table statement for this created table:

      spark.sql("SHOW CREATE TABLE pborshchenko.test_vector_spark_0911_1")

      Got:

      CREATE TABLE `pborshchenko`.`test_vector_spark_0911_1` (
       `features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)
      USING parquet

      Create the same table with index 2 at the end:

      spark.sql("CREATE TABLE `pborshchenko`.`test_vector_spark_0911_2` (\n`features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)\nUSING parquet")

      Try to insert new values to the table created from SQL:

       

      import org.apache.spark.mllib.linalg.Vector
      import org.apache.spark.mllib.linalg.VectorUDT
      import org.apache.spark.mllib.linalg.Vectors
      case class Test(features: Vector)
      Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
       .write
       .mode(SaveMode.Append)
       .insertInto("pborshchenko.test_vector_spark_0911_2")
      

       

      Got:
       

       AnalysisException: Cannot write incompatible data to table '`pborshchenko`.`test_vector_spark_0911_2`': - Cannot write 'features': struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is incompatible with struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;      - Cannot write 'features': struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is incompatible with struct<type:tinyint,size:int,indices:array<int>,values:array<double>>; at org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:467) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:494) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:486) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:112)    

       
      The reason that table created from spark SQL has the type STRUCT, not vector, but this struct is the right representation for vector type.

      AC: Should be possible to create a table using spark SQL with vector type column and after that write to it without any errors.

      Attachments

        Activity

          People

            Unassigned Unassigned
            clubnikon Pavlo Borshchenko
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: