Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3911

Define unique fields with @OutputSchema

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11, 0.12.0, 0.11.1, 0.12.1, 0.13.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None

      Description

      Based on PIG-2361, I took the liberty of extending @Outputschema so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of EvalFunc#outputSchema() can be eliminated from most of the UDFs.
      Examples:

      @OutputSchema("bytearray")
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
      }
      
      @OutputSchema("chararray")
      @Unique
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY));
      }
      
      @OutputSchema(value = "dimensions:bag", useInputSchema = true)
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
        return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
      }
      
      @OutputSchema(value = "${0}:bag", useInputSchema = true)
      @Unique("${0}")
      

      => equivalent to:

      @Override
      public Schema outputSchema(Schema input) {
          return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG));
      }
      

      If useInputSchema attribute is set then input schema will be applied to the output schema, provided that:

      • outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and
      • it has complex field type (tuple, bag, map)

      @Unique : this annotation defines which fields should be unique in the schema

      • if no parameters are provided, all fields will be unique
      • otherwise it takes a string array of fields name

      Unique field generation:
      A unique field is generated in the same manner that EvalFunc#getSchemaName does.

      • if field has an alias:
        • it's a placeholder (${i}, i=0..n) : fieldName -> com_myfunc_[input_alias]_[nextSchemaId]
        • otherwise: fieldName -> fieldName_[nextSchemaId]
      • otherwise: com_myfunc_[input_alias]_[nextSchemaId]

      Supported scripting UDFs: Python, Jython, Groovy, JRuby

        Attachments

        1. PIG-3911.patch
          387 kB
          Lorand Bendig

          Issue Links

            Activity

              People

              • Assignee:
                lbendig Lorand Bendig
                Reporter:
                lbendig Lorand Bendig
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: