Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
I ran parquet-cli's convert-csv with an input file which name starts with a numeric character without --schema option and got the following error:
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main convert-csv 0sample.csv -o sample.parquet
Unknown error
shaded.parquet.org.apache.avro.SchemaParseException: Illegal initial character: 0sample
at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1498)
at shaded.parquet.org.apache.avro.Schema.access$200(Schema.java:86)
at shaded.parquet.org.apache.avro.Schema$Name.<init>(Schema.java:645)
at shaded.parquet.org.apache.avro.Schema.createRecord(Schema.java:182)
at shaded.parquet.org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1805)
at org.apache.parquet.cli.csv.AvroCSV.inferSchemaInternal(AvroCSV.java:158)
at org.apache.parquet.cli.csv.AvroCSV.inferNullableSchema(AvroCSV.java:78)
at org.apache.parquet.cli.commands.ConvertCSVCommand.run(ConvertCSVCommand.java:160)
at org.apache.parquet.cli.Main.run(Main.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.parquet.cli.Main.main(Main.java:177)
This is because that convert-csv uses the input file name as the name for the output schema, while Avro requires its schema name to match the regex pattern [A-Za-z_][A-Za-z0-9_]*.
So users have to change the input file name or use the --schema option explicitly, but it's not so obvious from the error message.
It'd be nice if the message were improved, or the schema name were automatically replaced with valid characters to avoid this problem.
Attachments
Issue Links
- links to