Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1679

Invalid SchemaException for UUID while using AvroParquetWriter

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.10.1
    • Fix Version/s: None
    • Component/s: parquet-avro
    • Labels:
      None

      Description

      Hi,

      I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of PR-71 on 2.4 release.
      But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue?

      My setup details:

      gradle dependency :

      dependencies

      { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' }

      Model used:

      @Data
      public class Employee

      { private UUID id; private String name; private int age; private Address address; }

      @Data
      public class Address

      { private String streetName; private String city; private Zip zip; }

      @Data
      public class Zip

      { private int zip; private int ext; }

       

      My Serializer Code:

      public void serialize(List<D> inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException {

      Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
      Class clazz = inputDataToSerialize.get(0).getClass();

      try (ParquetWriter<D> writer = AvroParquetWriter.<D>builder(path)
      .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields
      .withDataModel(ReflectData.get())
      .withConf(parquetConfiguration)
      .withCompressionCodec(compressionCodecName)
      .withWriteMode(OVERWRITE)
      .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
      .build()) {

      for (D input : inputDataToSerialize)

      { writer.write(input); }

      }
      }

      private List<Employee> getInputDataToSerialize(){
      Address address = new Address();
      address.setStreetName("Murry Ridge Dr");
      address.setCity("Murrysville");
      Zip zip = new Zip();
      zip.setZip(15668);
      zip.setExt(1234);

      address.setZip(zip);

      List<Employee> employees = new ArrayList<>();

      IntStream.range(0, 100000).forEach(i->

      { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }

      );
      return employees;
      }

      **Where generic Type D is Employee

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              FelixKJose Felix Kizhakkel Jose
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: