[SPARK-33401] Vector type column is not possible to create using spark SQL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Created table with vector type column:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.VectorUDT
import org.apache.spark.mllib.linalg.Vectors
case class Test(features: Vector) 
Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
 .write
 .mode("overwrite")
 .saveAsTable("pborshchenko.test_vector_spark_0911_1")

Show the create table statement for this created table:

spark.sql("SHOW CREATE TABLE pborshchenko.test_vector_spark_0911_1")

Got:

CREATE TABLE `pborshchenko`.`test_vector_spark_0911_1` (
 `features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)
USING parquet

Create the same table with index 2 at the end:

spark.sql("CREATE TABLE `pborshchenko`.`test_vector_spark_0911_2` (\n`features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)\nUSING parquet")

Try to insert new values to the table created from SQL:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.VectorUDT
import org.apache.spark.mllib.linalg.Vectors
case class Test(features: Vector)
Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
 .write
 .mode(SaveMode.Append)
 .insertInto("pborshchenko.test_vector_spark_0911_2")

Got:

 AnalysisException: Cannot write incompatible data to table '`pborshchenko`.`test_vector_spark_0911_2`': - Cannot write 'features': struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is incompatible with struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;      - Cannot write 'features': struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is incompatible with struct<type:tinyint,size:int,indices:array<int>,values:array<double>>; at org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:467) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:494) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:486) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:112)

The reason that table created from spark SQL has the type STRUCT, not vector, but this struct is the right representation for vector type.

AC: Should be possible to create a table using spark SQL with vector type column and after that write to it without any errors.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pavlo Borshchenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Nov/20 17:34

Updated:: 10/Nov/20 16:10