It would be great to have an option in Spark's schema inference to not to convert to int/long datatype a column that has leading zeros. Think zip codes, for example.
df = (sqlc.read.format('csv') .option('inferSchema', True) .option('header', True) .option('delimiter', '|') .option('leadingZeros', 'KEEP') # this is the new proposed option .option('mode', 'FAILFAST') .load('csvfile_withzipcodes_to_ingest.csv') ) The general usage of data with trailing 0 is for Identifiers. If they are converted to int/long defeats the purpose of inferSchema. The conversion should be provided on the basis of a flag whether the data should be converted to int/long or not.
- is a clone of
SPARK-21978 schemaInference option not to convert strings with leading zeros to int/long