Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.10.1
-
None
-
None
Description
Hi,
I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc.
When I try to serialize a simple java object to parquet file, it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds.
Could you help me to resolve this issue?
+My Configuration and code snippet:
Gradle dependencies
dependencies
Code snippet:+
public void serialize(List<D> inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {
Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
Path path1 = new Path("/Downloads/data_"+compressionCodecName+".parquet");
Class clazz = inputDataToSerialize.get(0).getClass();
try (ParquetWriter<D> writer = AvroParquetWriter.<D>builder(path1)
.withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields
.withDataModel(ReflectData.get())
.withConf(parquetConfiguration)
.withCompressionCodec(compressionCodecName)
.withWriteMode(OVERWRITE)
.withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
.build()) {
for (D input : inputDataToSerialize)
{ writer.write(input); }}
}
+Model Used:
@Data
public class Employee
@Data
public class Address
@Data
public class Zip
private List<Employee> getInputDataToSerialize(){
Address address = new Address();
address.setStreetName("Murry Ridge Dr");
address.setCity("Murrysville");
Zip zip = new Zip();
zip.setZip(15668);
zip.setExt(1234);
address.setZip(zip);
List<Employee> employees = new ArrayList<>();
IntStream.range(0, 100000).forEach(i->
{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); });
return employees;
}
Note:
I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.