Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1680

Parquet Java Serialization is very slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.1
    • None
    • parquet-avro, parquet-mr
    • None

    Description

      Hi,
      I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. 
      When I try to serialize a simple java object to parquet file,  it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds.

      Could you help me to resolve this issue?

      +My Configuration and code snippet:
      Gradle dependencies
      dependencies

      { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' }

      Code snippet:+

      public void serialize(List<D> inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {

      Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
      Path path1 = new Path("/Downloads/data_"+compressionCodecName+".parquet");
      Class clazz = inputDataToSerialize.get(0).getClass();

      try (ParquetWriter<D> writer = AvroParquetWriter.<D>builder(path1)
      .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields
      .withDataModel(ReflectData.get())
      .withConf(parquetConfiguration)
      .withCompressionCodec(compressionCodecName)
      .withWriteMode(OVERWRITE)
      .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
      .build()) {

      for (D input : inputDataToSerialize)

      { writer.write(input); }

      }
      }

      +Model Used:
      @Data
      public class Employee

      { //private UUID id; private String name; private int age; private Address address; }

      @Data
      public class Address

      { private String streetName; private String city; private Zip zip; }

      @Data
      public class Zip

      { private int zip; private int ext; }

       

      private List<Employee> getInputDataToSerialize(){
      Address address = new Address();
      address.setStreetName("Murry Ridge Dr");
      address.setCity("Murrysville");
      Zip zip = new Zip();
      zip.setZip(15668);
      zip.setExt(1234);

      address.setZip(zip);

      List<Employee> employees = new ArrayList<>();

      IntStream.range(0, 100000).forEach(i->

      { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }

      );
      return employees;
      }

      Note:
      I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.

      Attachments

        Activity

          People

            Unassigned Unassigned
            FelixKJose Felix Kizhakkel Jose
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: