Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3911

Define unique fields with @OutputSchema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 0.11, 0.12.0, 0.11.1, 0.12.1, 0.13.0
    • 0.18.0
    • None
    • None

    Description

      Based on PIG-2361, I took the liberty of extending @Outputschema so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of EvalFunc#outputSchema() can be eliminated from most of the UDFs.
      Examples:

      @OutputSchema("bytearray")
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
      }
      
      @OutputSchema("chararray")
      @Unique
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY));
      }
      
      @OutputSchema(value = "dimensions:bag", useInputSchema = true)
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
      }
      
      @OutputSchema(value = "${0}:bag", useInputSchema = true)
      @Unique("${0}")
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
          return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG));
      }
      

      If useInputSchema attribute is set then input schema will be applied to the output schema, provided that:

      • outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and
      • it has complex field type (tuple, bag, map)

      @Unique : this annotation defines which fields should be unique in the schema

      • if no parameters are provided, all fields will be unique
      • otherwise it takes a string array of fields name

      Unique field generation:
      A unique field is generated in the same manner that EvalFunc#getSchemaName does.

      • if field has an alias:
        • it's a placeholder (${i}, i=0..n) : fieldName -> com_myfunc_[input_alias]_[nextSchemaId]
        • otherwise: fieldName -> fieldName_[nextSchemaId]
      • otherwise: com_myfunc_[input_alias]_[nextSchemaId]

      Supported scripting UDFs: Python, Jython, Groovy, JRuby

      Attachments

        1. PIG-3911.patch
          387 kB
          Lorand Bendig

        Issue Links

          Activity

            People

              lbendig Lorand Bendig
              lbendig Lorand Bendig
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: