[FLINK-2988] Cannot load DataSet[Row] from CSV file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 0.10.0
Fix Version/s: None
Component/s: API / DataSet, Table SQL / API
Labels:
None

Description

Tuple classes (Java/Scala both) only have arity up to 25, meaning I cannot load a CSV file with more than 25 columns directly as a DataSet[TupleX[...]].

An alternative to using Tuples is using the Table API's Row class, which allows for arbitrary-length, arbitrary-type, runtime-supplied schemata (using RowTypeInfo) and index-based access.

However, trying to load a CSV file as a DataSet[Row] yields an exception:

val env = ExecutionEnvironment.createLocalEnvironment()
val filePath = "../someCsv.csv"
val typeInfo = new RowTypeInfo(Seq(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO), Seq("word", "number"))
val source = env.readCsvFile(filePath)(ClassTag(classOf[Row]), typeInfo)
println(source.collect())

with someCsv.csv containing:

one,1
two,2

yields

Exception in thread "main" java.lang.ClassCastException: org.apache.flink.api.table.typeinfo.RowSerializer cannot be cast to org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase
	at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.<init>(ScalaCsvInputFormat.java:46)
	at org.apache.flink.api.scala.ExecutionEnvironment.readCsvFile(ExecutionEnvironment.scala:282)

As a user I would like to be able to load a CSV file into a DataSet[Row], preferably having a convenience method to specify the schema (RowTypeInfo), without having to use the "explicit implicit parameters" syntax and specifying the ClassTag.

Attachments

Issue Links

duplicates

FLINK-3901 Create a RowCsvInputFormat to use as default CSV IF in Table API

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Johann Kovacs

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Nov/15 15:05

Updated:: 17/Jun/16 14:15

Resolved:: 17/Jun/16 14:15