Pig
  1. Pig
  2. PIG-2632

Create a SchemaTuple which generates efficient Tuples via code gen

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None

      Description

      This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

      Need to clean up the code and add tests.

      Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

      1. PIG-2632-10.patch
        319 kB
        Jonathan Coveney
      2. PIG-2632-10.patch
        318 kB
        Jonathan Coveney
      3. PIG-2632-9.patch
        284 kB
        Jonathan Coveney
      4. PIG-2632-9.patch
        67 kB
        Jonathan Coveney
      5. schematuple benchmarking.pdf
        70 kB
        Jonathan Coveney
      6. PIG-2632-8.patch
        242 kB
        Jonathan Coveney
      7. PIG-2632-7.patch
        236 kB
        Jonathan Coveney
      8. PIG-2632-6.patch
        232 kB
        Jonathan Coveney
      9. PIG-2632-5.patch
        222 kB
        Jonathan Coveney
      10. PIG-2632-4.patch
        219 kB
        Jonathan Coveney
      11. schematuple benchmarking.pptx
        104 kB
        Jonathan Coveney
      12. PIG-2632-3.patch
        100 kB
        Jonathan Coveney
      13. PIG-2632-1.patch
        69 kB
        Jonathan Coveney
      14. PIG-2632-0.patch
        55 kB
        Jonathan Coveney

        Issue Links

          Activity

          Hide
          Jonathan Coveney added a comment -

          Oh, I should add that it currently works in both local and M/R mode It generates the class code, compiles it, adds it to the job jar, and then deserializes accordingly.

          Show
          Jonathan Coveney added a comment - Oh, I should add that it currently works in both local and M/R mode It generates the class code, compiles it, adds it to the job jar, and then deserializes accordingly.
          Hide
          Julien Le Dem added a comment -

          Great work Jonathan! Could you post it to reviewboard? Thanks

          Show
          Julien Le Dem added a comment - Great work Jonathan! Could you post it to reviewboard? Thanks
          Hide
          Jonathan Coveney added a comment -

          Here is an update of the patch...the big change being boolean support (it's easy to forget that that is an official data type now!).

          I also made it so that in an EvalFunc, if it is generatable, the Tuple you are given is a SchemaTuple. The potential benefit for replicated joins and whatnot is huge! Will push to reviewboard in a moment. It needs tests and comments seriously, but I've been waiting until it settles on a general form...

          Show
          Jonathan Coveney added a comment - Here is an update of the patch...the big change being boolean support (it's easy to forget that that is an official data type now!). I also made it so that in an EvalFunc, if it is generatable, the Tuple you are given is a SchemaTuple . The potential benefit for replicated joins and whatnot is huge! Will push to reviewboard in a moment. It needs tests and comments seriously, but I've been waiting until it settles on a general form...
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/
          -----------------------------------------------------------

          Review request for Julien Le Dem.

          Summary
          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.
          https://issues.apache.org/jira/browse/PIG-2632

          Diffs


          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628
          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628
          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628
          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
          trunk/src/org/apache/pig/data/TupleFactory.java 1309628
          trunk/src/org/apache/pig/impl/PigContext.java 1309628
          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing
          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- Review request for Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6722
          -----------------------------------------------------------

          This is an incredible piece of work!
          See my comments bellow

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
          <https://reviews.apache.org/r/4651/#comment14556>

          it would be nice if Pig called outputSchema() once during the init and that was saved somewhere.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
          <https://reviews.apache.org/r/4651/#comment14557>

          So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here.
          like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple.
          also that would remove duplication here.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
          <https://reviews.apache.org/r/4651/#comment14558>

          those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples?

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
          <https://reviews.apache.org/r/4651/#comment14559>

          I would make it:
          TupleFactory.geInstanceForSchema().newTuple()

          trunk/src/org/apache/pig/data/BinInterSedes.java
          <https://reviews.apache.org/r/4651/#comment14560>

          should SchemaTuple extend TypeAwareTuple ?

          trunk/src/org/apache/pig/data/PrimitiveTuple.java
          <https://reviews.apache.org/r/4651/#comment14564>

          BytesKey ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14573>

          statics on top

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14574>

          largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14575>

          isAppendNull()
          Should the append...() methods be public?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14576>

          the result would be the same, right?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14580>

          setAppend()

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14578>

          http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14631>

          if it is always overridden, it should be abstract

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14579>

          getAppendType()

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14584>

          set? stream?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14583>

          getSizeNoAppend() ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14585>

          redundant with the 2 others?
          idClassMap.get(schemaIdMap.get(schema))

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14586>

          could be computed statically by the generated class if you make a static getSchema(schemaString).

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14587>

          setAll() ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14588>

          setAll() ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14589>

          do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14591>

          you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id

          you could also use true. Although that would not be THE answer to THE question.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14592>

          As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14593>

          looking forward to support for those

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14596>

          there's only one code path where updateMaps is true. It could be simplified a bit.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14595>

          Logger

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14597>

          they should not be public

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14598>

          refactor DataByteArray to provide this?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14599>

          Same comment, the String serialization should be shared with the regular Tuple.

          Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result)

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14600>

          same comment as before.
          The same logic should be use with regular tuples.
          The same logic should be shared across types.

          It could also use a variable size int format.
          https://developers.google.com/protocol-buffers/docs/encoding#types

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14601>

          is (null, null, null) the same as null ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14609>

          this.set(t) ?

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14610>

          DefaultTuple replaces null with ""

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14611>

          We could have a list implementation that just wraps a tuple.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14612>

          yep, sounds like it could be generated and fall back to this if the other is a regular tuple

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14614>

          you may want to log compilation warnings/errors

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14613>

          Class.forName(className) uses the classLoader of the current class.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14615>

          if it does not need to be public then yes

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14616>

          same comment as before about stripped aliases

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14617>

          the parameter t is thrown away ?

          they don't need to be public.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14618>

          depending on how those are used, java already does auto un-boxing.
          The following works so you may not need them:
          long unbox(Long v)

          { return v; }

          they don't need to be public

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14620>

          same comment, the following works:
          Float box(float v)

          { return v; }

          they don't need to be public

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14622>

          this file has 1500 lines
          I think those classes have earned their own package.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14625>

          this could be a method in the parent class

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14626>

          this could be a method in the parent class

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14635>

          yes, an array is an object, so reference + sizeof(array)

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14638>

          CheckIfNullString.masks
          maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...)

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14643>

          This feels weird.
          what about:
          "SchemaTuple.unbox(("getBoxedTypeName()")appendGet(diff))"

          getBoxedTypeName() {
          switch (type) {
          case (DataType.INTEGER): return "Integer";
          ...

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14621>

          use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment14619>

          what does isObjNotTup mean? What is this case?

          • Julien

          On 2012-04-05 01:31:00, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-05 01:31:00)

          Review request for Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628

          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/TupleFactory.java 1309628

          trunk/src/org/apache/pig/impl/PigContext.java 1309628

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- This is an incredible piece of work! See my comments bellow trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java < https://reviews.apache.org/r/4651/#comment14556 > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java < https://reviews.apache.org/r/4651/#comment14557 > So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here. like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple. also that would remove duplication here. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java < https://reviews.apache.org/r/4651/#comment14558 > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples? trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java < https://reviews.apache.org/r/4651/#comment14559 > I would make it: TupleFactory.geInstanceForSchema().newTuple() trunk/src/org/apache/pig/data/BinInterSedes.java < https://reviews.apache.org/r/4651/#comment14560 > should SchemaTuple extend TypeAwareTuple ? trunk/src/org/apache/pig/data/PrimitiveTuple.java < https://reviews.apache.org/r/4651/#comment14564 > BytesKey ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14573 > statics on top trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14574 > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14575 > isAppendNull() Should the append...() methods be public? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14576 > the result would be the same, right? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14580 > setAppend() trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14578 > http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14631 > if it is always overridden, it should be abstract trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14579 > getAppendType() trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14584 > set? stream? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14583 > getSizeNoAppend() ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14585 > redundant with the 2 others? idClassMap.get(schemaIdMap.get(schema)) trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14586 > could be computed statically by the generated class if you make a static getSchema(schemaString). trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14587 > setAll() ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14588 > setAll() ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14589 > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14591 > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id you could also use true. Although that would not be THE answer to THE question. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14592 > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14593 > looking forward to support for those trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14596 > there's only one code path where updateMaps is true. It could be simplified a bit. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14595 > Logger trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14597 > they should not be public trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14598 > refactor DataByteArray to provide this? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14599 > Same comment, the String serialization should be shared with the regular Tuple. Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14600 > same comment as before. The same logic should be use with regular tuples. The same logic should be shared across types. It could also use a variable size int format. https://developers.google.com/protocol-buffers/docs/encoding#types trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14601 > is (null, null, null) the same as null ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14609 > this.set(t) ? trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14610 > DefaultTuple replaces null with "" trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14611 > We could have a list implementation that just wraps a tuple. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14612 > yep, sounds like it could be generated and fall back to this if the other is a regular tuple trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14614 > you may want to log compilation warnings/errors trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14613 > Class.forName(className) uses the classLoader of the current class. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14615 > if it does not need to be public then yes trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14616 > same comment as before about stripped aliases trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14617 > the parameter t is thrown away ? they don't need to be public. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14618 > depending on how those are used, java already does auto un-boxing. The following works so you may not need them: long unbox(Long v) { return v; } they don't need to be public trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14620 > same comment, the following works: Float box(float v) { return v; } they don't need to be public trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14622 > this file has 1500 lines I think those classes have earned their own package. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14625 > this could be a method in the parent class trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14626 > this could be a method in the parent class trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14635 > yes, an array is an object, so reference + sizeof(array) trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14638 > CheckIfNullString.masks maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14643 > This feels weird. what about: "SchemaTuple.unbox((" getBoxedTypeName() ")appendGet(diff))" getBoxedTypeName() { switch (type) { case (DataType.INTEGER): return "Integer"; ... trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14621 > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment14619 > what does isObjNotTup mean? What is this case? Julien On 2012-04-05 01:31:00, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-05 01:31:00) Review request for Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 133

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133>

          >

          > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere.

          It calls it many times, alas. I feel like changing that is a job for another patch, but I could tackle that here. But given there is already the expectation that it will be called an arbitrary number of times, I don't think that calling it once more on the front end will be an issue?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 136-145

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line136>

          >

          > So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here.

          > like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple.

          > also that would remove duplication here.

          I was trying to leverage Dmitriy's newTupleForSchema, but I think you're right.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 151-152

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151>

          >

          > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples?

          Well, we need to know what the tuple to generate is. I will change it save the ID that serves as the base, but as is the Schema info isn't saved and it isn't available.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>

          >

          > I would make it:

          > TupleFactory.geInstanceForSchema().newTuple()

          I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/BinInterSedes.java, lines 601-602

          > <https://reviews.apache.org/r/4651/diff/1/?file=100076#file100076line601>

          >

          > should SchemaTuple extend TypeAwareTuple ?

          Agreed, and my working version has that update.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/PrimitiveTuple.java, line 213

          > <https://reviews.apache.org/r/4651/diff/1/?file=100077#file100077line213>

          >

          > BytesKey ?

          good call

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 46-48

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line46>

          >

          > statics on top

          I really need to clean up where everything goes, noted.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>

          >

          > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema?

          I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 77

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line77>

          >

          > isAppendNull()

          > Should the append...() methods be public?

          I think a lot of these methods should be protected, agreed

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 81

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line81>

          >

          > the result would be the same, right?

          yeah, you're right

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 94

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line94>

          >

          > setAppend()

          yeah need to rename all of these. naming is what I am worst at :S

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105>

          >

          > if it is always overridden, it should be abstract

          It's overriden, but the overriden version calls super. Right now it returns 0 because I don't want to take the time to calculate it out while the class is still evolving.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 119

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line119>

          >

          > set? stream?

          This originally meant "Schema Tuple Element Write" but yeah, it should be more descriptive.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125>

          >

          > redundant with the 2 others?

          > idClassMap.get(schemaIdMap.get(schema))

          Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is for Tuple creation to be as fast as possible. This is why there are 5 maps... my thought is that they are all static, so the extra cost of caching the extra info is worth avoiding two gets in a row to link the info together when we need to make Tuples. Am open to thoughts.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>

          >

          > could be computed statically by the generated class if you make a static getSchema(schemaString).

          Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144>

          >

          > setAll() ?

          It's not really a setAll imho. It's setting it equal to. Maybe that's what setAll means, but it seems to convey something different.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>

          >

          > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ?

          My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191>

          >

          > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id

          >

          > you could also use true. Although that would not be THE answer to THE question.

          I think the jar properties is the way to go, at least as the ultimate source of the classes that were generated. Alternately, it'd be nice if there was an easy way to browse a jar for certain class files, but I couldn't find a clean one? If there was, we could just search for SchemaTuple_*.class and keep track of which files went across. Do you know if there is an easier way to search jars like that? I can dig around as well.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209>

          >

          > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class.

          Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" except with different names would necessitate a different class. More importantly, however, I ran into cases where schema names were different due to the way that names were generated by different pieces of code, and so the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think the win from get(String) is worth the potential complication? It could be convenient...

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270>

          >

          > Logger

          woops, I usually mark these //remove and whack them before the patch goes out...

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 268

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line268>

          >

          > there's only one code path where updateMaps is true. It could be simplified a bit.

          agreed

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>

          >

          > refactor DataByteArray to provide this?

          So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static).

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>

          >

          > Same comment, the String serialization should be shared with the regular Tuple.

          >

          > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result)

          This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs).

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 344-353

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line344>

          >

          > same comment as before.

          > The same logic should be use with regular tuples.

          > The same logic should be shared across types.

          >

          > It could also use a variable size int format.

          > https://developers.google.com/protocol-buffers/docs/encoding#types

          varints is a good idea. general question of where this logic should live still holds as well, as this logic is also directly ripped from BinInterSedes.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>

          >

          > is (null, null, null) the same as null ?

          In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>

          >

          > this.set(t) ?

          reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 479

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line479>

          >

          > DefaultTuple replaces null with ""

          Good idea

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484>

          >

          > We could have a list implementation that just wraps a tuple.

          Hmm, I think this is an intriguing idea, though it might get sills, because that means you'd have a list which wraps a tuple which wraps a list.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495>

          >

          > yep, sounds like it could be generated and fall back to this if the other is a regular tuple

          My working copy almost has this finished. there are three comparison functions: compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if you were actually given a SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. I also need to generate a RawComparator but I'd rather clean everything else up first (especially the layout of the API and where pieces live).

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535>

          >

          > you may want to log compilation warnings/errors

          Is there a need to log it if the runtime exception is going to be thrown anyway? Because those errors get written to stderr. But I suppose I could pipe them through the Logger, if that's what you mean?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 545

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line545>

          >

          > Class.forName(className) uses the classLoader of the current class.

          nice!

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 585

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line585>

          >

          > if it does not need to be public then yes

          it depends heavily on how we end up forcing people to make these things.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609>

          >

          > the parameter t is thrown away ?

          >

          > they don't need to be public.

          This is more for conciseness in the code that generates the SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I suppose there are other ways to do it that don't involve a method call... I go back and forth on that. But the reason is so later on I can do:
          add(" setPos_X(SchemaTuple.unbox(val, pos_X));");
          which means that the right one will be based on the type of the field being filled (ie saving me a bunch of lines, but perhaps that should be trumped by cleanliness of the output code/API).

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 637-659

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line637>

          >

          > depending on how those are used, java already does auto un-boxing.

          > The following works so you may not need them:

          > long unbox(Long v) { bq. > return v; bq. > }

          >

          > they don't need to be public

          Good point. I hate auto-boxing/un-boxing as a language feature, but if it can be utilized to clean things up it is worth it.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 667-685

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line667>

          >

          > same comment, the following works:

          > Float box(float v) { bq. > return v; bq. > }

          >

          > they don't need to be public

          Yeah same as above as well. I think I can actually also leverage auto boxing/unboxing deeper in the code to remove some of the need for these functions.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>

          >

          > this file has 1500 lines

          > I think those classes have earned their own package.

          Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857>

          >

          > this could be a method in the parent class

          do you mean in SchemaTuple, or in TypeInFunctionStringOut?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>

          >

          > CheckIfNullString.masks

          > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...)

          I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283>

          >

          > This feels weird.

          > what about:

          > "SchemaTuple.unbox(("getBoxedTypeName()")appendGet(diff))"

          >

          > getBoxedTypeName() {

          > switch (type) {

          > case (DataType.INTEGER): return "Integer";

          > ...

          I actually was able to replace it with:
          add(" case ("fieldNum"): return "fs.type";");
          sometimes I miss the forest for the trees.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413>

          >

          > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations.

          yeahh this is a good idea. I doubt the code gen takes up much time at all but given the gen'd files are potentially large, it's probably good practice

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522>

          >

          > what does isObjNotTup mean? What is this case?

          it was an obscure case that was obsolete. I removed it

          • Jonathan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6722
          -----------------------------------------------------------

          On 2012-04-05 01:31:00, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-05 01:31:00)

          Review request for Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628

          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/TupleFactory.java 1309628

          trunk/src/org/apache/pig/impl/PigContext.java 1309628

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 133 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133 > > > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere. It calls it many times, alas. I feel like changing that is a job for another patch, but I could tackle that here. But given there is already the expectation that it will be called an arbitrary number of times, I don't think that calling it once more on the front end will be an issue? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 136-145 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line136 > > > So the side effect is that the Tuple class get generated here ? This should be encapsulated in a more obvious method here. > like: generateTupleForSchema(tmpS) and not necessarily create an instance of a tuple. > also that would remove duplication here. I was trying to leverage Dmitriy's newTupleForSchema, but I think you're right. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 151-152 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151 > > > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples? Well, we need to know what the tuple to generate is. I will change it save the ID that serves as the base, but as is the Schema info isn't saved and it isn't available. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188 > > > I would make it: > TupleFactory.geInstanceForSchema().newTuple() I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/BinInterSedes.java, lines 601-602 > < https://reviews.apache.org/r/4651/diff/1/?file=100076#file100076line601 > > > should SchemaTuple extend TypeAwareTuple ? Agreed, and my working version has that update. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/PrimitiveTuple.java, line 213 > < https://reviews.apache.org/r/4651/diff/1/?file=100077#file100077line213 > > > BytesKey ? good call On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 46-48 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line46 > > > statics on top I really need to clean up where everything goes, noted. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70 > > > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 77 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line77 > > > isAppendNull() > Should the append...() methods be public? I think a lot of these methods should be protected, agreed On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 81 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line81 > > > the result would be the same, right? yeah, you're right On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 94 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line94 > > > setAppend() yeah need to rename all of these. naming is what I am worst at :S On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105 > > > if it is always overridden, it should be abstract It's overriden, but the overriden version calls super. Right now it returns 0 because I don't want to take the time to calculate it out while the class is still evolving. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 119 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line119 > > > set? stream? This originally meant "Schema Tuple Element Write" but yeah, it should be more descriptive. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125 > > > redundant with the 2 others? > idClassMap.get(schemaIdMap.get(schema)) Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is for Tuple creation to be as fast as possible. This is why there are 5 maps... my thought is that they are all static, so the extra cost of caching the extra info is worth avoiding two gets in a row to link the info together when we need to make Tuples. Am open to thoughts. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130 > > > could be computed statically by the generated class if you make a static getSchema(schemaString). Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144 > > > setAll() ? It's not really a setAll imho. It's setting it equal to. Maybe that's what setAll means, but it seems to convey something different. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159 > > > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191 > > > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id > > you could also use true. Although that would not be THE answer to THE question. I think the jar properties is the way to go, at least as the ultimate source of the classes that were generated. Alternately, it'd be nice if there was an easy way to browse a jar for certain class files, but I couldn't find a clean one? If there was, we could just search for SchemaTuple_*.class and keep track of which files went across. Do you know if there is an easier way to search jars like that? I can dig around as well. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209 > > > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class. Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" except with different names would necessitate a different class. More importantly, however, I ran into cases where schema names were different due to the way that names were generated by different pieces of code, and so the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think the win from get(String) is worth the potential complication? It could be convenient... On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270 > > > Logger woops, I usually mark these //remove and whack them before the patch goes out... On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 268 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line268 > > > there's only one code path where updateMaps is true. It could be simplified a bit. agreed On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310 > > > refactor DataByteArray to provide this? So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static). On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325 > > > Same comment, the String serialization should be shared with the regular Tuple. > > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs). On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 344-353 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line344 > > > same comment as before. > The same logic should be use with regular tuples. > The same logic should be shared across types. > > It could also use a variable size int format. > https://developers.google.com/protocol-buffers/docs/encoding#types varints is a good idea. general question of where this logic should live still holds as well, as this logic is also directly ripped from BinInterSedes. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452 > > > is (null, null, null) the same as null ? In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474 > > > this.set(t) ? reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 479 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line479 > > > DefaultTuple replaces null with "" Good idea On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484 > > > We could have a list implementation that just wraps a tuple. Hmm, I think this is an intriguing idea, though it might get sills, because that means you'd have a list which wraps a tuple which wraps a list. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495 > > > yep, sounds like it could be generated and fall back to this if the other is a regular tuple My working copy almost has this finished. there are three comparison functions: compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if you were actually given a SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. I also need to generate a RawComparator but I'd rather clean everything else up first (especially the layout of the API and where pieces live). On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535 > > > you may want to log compilation warnings/errors Is there a need to log it if the runtime exception is going to be thrown anyway? Because those errors get written to stderr. But I suppose I could pipe them through the Logger, if that's what you mean? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 545 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line545 > > > Class.forName(className) uses the classLoader of the current class. nice! On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 585 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line585 > > > if it does not need to be public then yes it depends heavily on how we end up forcing people to make these things. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609 > > > the parameter t is thrown away ? > > they don't need to be public. This is more for conciseness in the code that generates the SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I suppose there are other ways to do it that don't involve a method call... I go back and forth on that. But the reason is so later on I can do: add(" setPos_X(SchemaTuple.unbox(val, pos_X));"); which means that the right one will be based on the type of the field being filled (ie saving me a bunch of lines, but perhaps that should be trumped by cleanliness of the output code/API). On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 637-659 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line637 > > > depending on how those are used, java already does auto un-boxing. > The following works so you may not need them: > long unbox(Long v) { bq. > return v; bq. > } > > they don't need to be public Good point. I hate auto-boxing/un-boxing as a language feature, but if it can be utilized to clean things up it is worth it. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 667-685 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line667 > > > same comment, the following works: > Float box(float v) { bq. > return v; bq. > } > > they don't need to be public Yeah same as above as well. I think I can actually also leverage auto boxing/unboxing deeper in the code to remove some of the need for these functions. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698 > > > this file has 1500 lines > I think those classes have earned their own package. Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857 > > > this could be a method in the parent class do you mean in SchemaTuple, or in TypeInFunctionStringOut? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138 > > > CheckIfNullString.masks > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283 > > > This feels weird. > what about: > "SchemaTuple.unbox((" getBoxedTypeName() ")appendGet(diff))" > > getBoxedTypeName() { > switch (type) { > case (DataType.INTEGER): return "Integer"; > ... I actually was able to replace it with: add(" case (" fieldNum "): return " fs.type ";"); sometimes I miss the forest for the trees. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413 > > > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations. yeahh this is a good idea. I doubt the code gen takes up much time at all but given the gen'd files are potentially large, it's probably good practice On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522 > > > what does isObjNotTup mean? What is this case? it was an obscure case that was obsolete. I removed it Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- On 2012-04-05 01:31:00, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-05 01:31:00) Review request for Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 133

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133>

          >

          > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere.

          Jonathan Coveney wrote:

          It calls it many times, alas. I feel like changing that is a job for another patch, but I could tackle that here. But given there is already the expectation that it will be called an arbitrary number of times, I don't think that calling it once more on the front end will be an issue?

          I agree,let's not add to this already big patch. This should be dealt with in a separate patch.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 151-152

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151>

          >

          > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples?

          Jonathan Coveney wrote:

          Well, we need to know what the tuple to generate is. I will change it save the ID that serves as the base, but as is the Schema info isn't saved and it isn't available.

          I was more commenting on the naming. Those are the input and output schema independently of code generation.

          Knowing if you can generate or not could be saved with a boolean.

          Possibly you don't even need to save the information than can generate the tuples or not. It could fall back to the DefaultTuple.

          (Anyway, this one is not very important)

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>

          >

          > I would make it:

          > TupleFactory.geInstanceForSchema().newTuple()

          Jonathan Coveney wrote:

          I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this.

          I meant:
          TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

          The change is relatively small, it is just moving this from one class to the other and getting rid of the Maps. PrimitiveTuple should follow the same logic.

          Asking for the schema in the factory initialization will simplify the code. As you say, you won't need to constantly look up the schema. Also pig is about writing a lot of tuples of the same type.

          The block of code (see 2 comments above) that triggers the code generation would just get the factories instead of actually generating Tuples. This is caused by the api asking for the schema in the wrong place.

          TupleFactory is an abstract class so it should be doable to add methods without breaking compatibility.

          If needed, backard compatibility can be maintained by having newTupleForSchema(inputSchemaForGen) call TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>

          >

          > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema?

          Jonathan Coveney wrote:

          I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it.

          Correct but know you have to check this every time you update. I was thinking it would simplify the code a little if it was just the actual size of the tuple (including past the original size).

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105>

          >

          > if it is always overridden, it should be abstract

          Jonathan Coveney wrote:

          It's overriden, but the overriden version calls super. Right now it returns 0 because I don't want to take the time to calculate it out while the class is still evolving.

          Ok, let's make it abstract once we get to this.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125>

          >

          > redundant with the 2 others?

          > idClassMap.get(schemaIdMap.get(schema))

          Jonathan Coveney wrote:

          Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is for Tuple creation to be as fast as possible. This is why there are 5 maps... my thought is that they are all static, so the extra cost of caching the extra info is worth avoiding two gets in a row to link the info together when we need to make Tuples. Am open to thoughts.

          I like the SchemaTupleFactory approach. It would alleviate the need for this.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>

          >

          > could be computed statically by the generated class if you make a static getSchema(schemaString).

          Jonathan Coveney wrote:

          Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization?

          caching is fine.
          I meant that you could have the Schema object "cached" in a static field of the generated class
          instead of a runtime generation + cache

          in the generated class:

          static schemaString = "....";
          static Schema schema = Utils.getSchemaFromString(schemaString);

          public Schema getSchema()

          { return schema; }

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144>

          >

          > setAll() ?

          Jonathan Coveney wrote:

          It's not really a setAll imho. It's setting it equal to. Maybe that's what setAll means, but it seems to convey something different.

          It sounds like setEqualTo() is what set() generally means. when you set something to a value then you expect it to be equal to the value. What do you think?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>

          >

          > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ?

          Jonathan Coveney wrote:

          My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm.

          I'm fine with the chaining mechanism.
          We should try to avoid having both as it makes the code harder to read. You cant set and ignore the return value;

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191>

          >

          > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id

          >

          > you could also use true. Although that would not be THE answer to THE question.

          Jonathan Coveney wrote:

          I think the jar properties is the way to go, at least as the ultimate source of the classes that were generated. Alternately, it'd be nice if there was an easy way to browse a jar for certain class files, but I couldn't find a clean one? If there was, we could just search for SchemaTuple_*.class and keep track of which files went across. Do you know if there is an easier way to search jars like that? I can dig around as well.

          tl;dr: As you know what classes you generate I would go for a mechanism where you provide the list of classes to load like a properties file in the jar.

          I don't think there is a clean way of scanning a package out of the box. Usually dependency injection containers (like Spring) have a mechanism for scanning the classpath for classes with a given annotation.
          example:
          http://static.springsource.org/spring/docs/3.0.0.M3/spring-framework-reference/html/ch04s12.html
          (I'm not saying we should add dependencies to Pig! Just FYI)

          You could have a simple scanning mechanism using getResource():
          http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/ClassLoader.html#getResource%28java.lang.String%29
          this.getClass().getClassLoader().getResource("my/package/path")
          will return a URL and you can scan yourself the location of the URL.
          If the package is not in a jar and on the local file system, you will get a file:/... url that you can easily make a File of.
          If the package is in a jar, then you will need to use a JarFile to scan the content of the package in the jar.
          The issue here is that the classloading mechanism is extensible: The URL could be a HTTP url or some other that do not provide a listing option. There is no way to have a fully generic scanning mechanism if you don't know the name of the classes you want.

          A search for "enumerate classes in a package" returns examples of people doing this.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209>

          >

          > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class.

          Jonathan Coveney wrote:

          Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" except with different names would necessitate a different class. More importantly, however, I ran into cases where schema names were different due to the way that names were generated by different pieces of code, and so the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think the win from get(String) is worth the potential complication? It could be convenient...

          I would need to look more into it. I would be interested in seeing the problems you see when not stripping the names.

          This seems changeable later if needed, so let's keep it that way for now.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270>

          >

          > Logger

          Jonathan Coveney wrote:

          woops, I usually mark these //remove and whack them before the patch goes out...

          no worries. Usually it can be useful to keep debug log statements for future modification.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>

          >

          > refactor DataByteArray to provide this?

          Jonathan Coveney wrote:

          So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static).

          private stuff can be safely moved around and refactored (from a compatibility point of view)
          Let's think of a way to have the same code used in all cases. There used to be one type of tuple so what made sense before may need to changee.

          • parent class for both ?
          • TupleSerializer class ?
          • static helpers?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>

          >

          > Same comment, the String serialization should be shared with the regular Tuple.

          >

          > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result)

          Jonathan Coveney wrote:

          This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs).

          Following your goal of reducing memory utilization and as the serialized format is UTF8, we could keep UTF8 as the memory representation instead of String.
          Java Strings are UTF16 which is minimum 2 bytes per characters. If the data is mostly ascii, this is wasteful.
          To maintain backward compatibility the get(int) would lazily convert to String on demand, possibly caching the result in a second field to save processing, but this may be overkill.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>

          >

          > is (null, null, null) the same as null ?

          Jonathan Coveney wrote:

          In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly.

          agreed, we should just make sure this is the same behavior as the DefaultTuple

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>

          >

          > this.set(t) ?

          Jonathan Coveney wrote:

          reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects.

          "the underlying structures are primitives, not objects"
          for now, as Bags and Maps will come.
          For backward compatibility, I think we should implement this as a set(). It seems this is intended as a cheap set, not sure if "modifying one modifies the others" behavior is expected. some UDFs could depend on this.

          Also let's mark reference() as deprecated right now so that we can remove it later.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484>

          >

          > We could have a list implementation that just wraps a tuple.

          Jonathan Coveney wrote:

          Hmm, I think this is an intriguing idea, though it might get sills, because that means you'd have a list which wraps a tuple which wraps a list.

          this is just for saving a list allocation. A wrapper list would use less memory.

          We can punt this for now.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495>

          >

          > yep, sounds like it could be generated and fall back to this if the other is a regular tuple

          Jonathan Coveney wrote:

          My working copy almost has this finished. there are three comparison functions: compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if you were actually given a SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. I also need to generate a RawComparator but I'd rather clean everything else up first (especially the layout of the API and where pieces live).

          cool

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535>

          >

          > you may want to log compilation warnings/errors

          Jonathan Coveney wrote:

          Is there a need to log it if the runtime exception is going to be thrown anyway? Because those errors get written to stderr. But I suppose I could pipe them through the Logger, if that's what you mean?

          Ok.
          You can also get them programatically, but stderr is good too. Let's make sure we don't swamp the output of Pig with too many messages.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609>

          >

          > the parameter t is thrown away ?

          >

          > they don't need to be public.

          Jonathan Coveney wrote:

          This is more for conciseness in the code that generates the SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I suppose there are other ways to do it that don't involve a method call... I go back and forth on that. But the reason is so later on I can do:

          add(" setPos_X(SchemaTuple.unbox(val, pos_X));");

          which means that the right one will be based on the type of the field being filled (ie saving me a bunch of lines, but perhaps that should be trumped by cleanliness of the output code/API).

          Ah yeah, this comment is obsoleted by my other comment bellow. I should have removed it.
          I saw how it simplifies the generation later.
          See my proposal for this bellow.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>

          >

          > this file has 1500 lines

          > I think those classes have earned their own package.

          Jonathan Coveney wrote:

          Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified?

          I don't feel strongly about it.
          Maybe you want to separate the base class SchemaTuple (runtime) from the Generator ?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857>

          >

          > this could be a method in the parent class

          Jonathan Coveney wrote:

          do you mean in SchemaTuple, or in TypeInFunctionStringOut?

          Where it makes sense to you.
          based on the way you removed everything that does not need to be generated, you could factor out the error handling and just generate "default: handleDefault(fieldNum)"

          in SchemaTuple?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>

          >

          > CheckIfNullString.masks

          > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...)

          Jonathan Coveney wrote:

          I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here.

          My goal is to have the bit management thing implemented once so that it is easier to change/improve/bugfix.
          agreed that we should keep things simple.
          If we have a SchemaTupleClassGenerator that contains bit management helpers that could simplify things.
          Separating the runtime from the generation would reduce the amount of things you have to look at at the same time.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283>

          >

          > This feels weird.

          > what about:

          > "SchemaTuple.unbox(("getBoxedTypeName()")appendGet(diff))"

          >

          > getBoxedTypeName() {

          > switch (type) {

          > case (DataType.INTEGER): return "Integer";

          > ...

          Jonathan Coveney wrote:

          I actually was able to replace it with:

          add(" case ("fieldNum"): return "fs.type";");

          sometimes I miss the forest for the trees.

          cool

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413>

          >

          > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations.

          Jonathan Coveney wrote:

          yeahh this is a good idea. I doubt the code gen takes up much time at all but given the gen'd files are potentially large, it's probably good practice

          a concatenation copies both strings, so concatenating large strings over and over can become slow.

          Note that "a""B"+foo+bar"c" gets replaced by the compiler by new StringBuilder().append("a").append("B").append(foo).append(bar).append("c").toString()
          But you still concatenate those to other strings so will end up duplicating large strings.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522>

          >

          > what does isObjNotTup mean? What is this case?

          Jonathan Coveney wrote:

          it was an obscure case that was obsolete. I removed it

          ok

          • Julien

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6722
          -----------------------------------------------------------

          On 2012-04-05 01:31:00, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-05 01:31:00)

          Review request for Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628

          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/TupleFactory.java 1309628

          trunk/src/org/apache/pig/impl/PigContext.java 1309628

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 133 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133 > > > it would be nice if Pig called outputSchema() once during the init and that was saved somewhere. Jonathan Coveney wrote: It calls it many times, alas. I feel like changing that is a job for another patch, but I could tackle that here. But given there is already the expectation that it will be called an arbitrary number of times, I don't think that calling it once more on the front end will be an issue? I agree,let's not add to this already big patch. This should be dealt with in a separate patch. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 151-152 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151 > > > those are the inputSchema and the outputSchema. The fields should probably called just that and have boolean flag to know if we use generated Tuples? Jonathan Coveney wrote: Well, we need to know what the tuple to generate is. I will change it save the ID that serves as the base, but as is the Schema info isn't saved and it isn't available. I was more commenting on the naming. Those are the input and output schema independently of code generation. Knowing if you can generate or not could be saved with a boolean. Possibly you don't even need to save the information than can generate the tuples or not. It could fall back to the DefaultTuple. (Anyway, this one is not very important) On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188 > > > I would make it: > TupleFactory.geInstanceForSchema().newTuple() Jonathan Coveney wrote: I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this. I meant: TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple() The change is relatively small, it is just moving this from one class to the other and getting rid of the Maps. PrimitiveTuple should follow the same logic. Asking for the schema in the factory initialization will simplify the code. As you say, you won't need to constantly look up the schema. Also pig is about writing a lot of tuples of the same type. The block of code (see 2 comments above) that triggers the code generation would just get the factories instead of actually generating Tuples. This is caused by the api asking for the schema in the wrong place. TupleFactory is an abstract class so it should be doable to add methods without breaking compatibility. If needed, backard compatibility can be maintained by having newTupleForSchema(inputSchemaForGen) call TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple() On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70 > > > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? Jonathan Coveney wrote: I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it. Correct but know you have to check this every time you update. I was thinking it would simplify the code a little if it was just the actual size of the tuple (including past the original size). On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105 > > > if it is always overridden, it should be abstract Jonathan Coveney wrote: It's overriden, but the overriden version calls super. Right now it returns 0 because I don't want to take the time to calculate it out while the class is still evolving. Ok, let's make it abstract once we get to this. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125 > > > redundant with the 2 others? > idClassMap.get(schemaIdMap.get(schema)) Jonathan Coveney wrote: Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is for Tuple creation to be as fast as possible. This is why there are 5 maps... my thought is that they are all static, so the extra cost of caching the extra info is worth avoiding two gets in a row to link the info together when we need to make Tuples. Am open to thoughts. I like the SchemaTupleFactory approach. It would alleviate the need for this. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130 > > > could be computed statically by the generated class if you make a static getSchema(schemaString). Jonathan Coveney wrote: Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization? caching is fine. I meant that you could have the Schema object "cached" in a static field of the generated class instead of a runtime generation + cache in the generated class: static schemaString = "...."; static Schema schema = Utils.getSchemaFromString(schemaString); public Schema getSchema() { return schema; } On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144 > > > setAll() ? Jonathan Coveney wrote: It's not really a setAll imho. It's setting it equal to. Maybe that's what setAll means, but it seems to convey something different. It sounds like setEqualTo() is what set() generally means. when you set something to a value then you expect it to be equal to the value. What do you think? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159 > > > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? Jonathan Coveney wrote: My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm. I'm fine with the chaining mechanism. We should try to avoid having both as it makes the code harder to read. You cant set and ignore the return value; On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191 > > > you could add a properties file to the jar with this info. Or a generated factory that knows all the classes generated. You mainly need the mapping schema -> id > > you could also use true. Although that would not be THE answer to THE question. Jonathan Coveney wrote: I think the jar properties is the way to go, at least as the ultimate source of the classes that were generated. Alternately, it'd be nice if there was an easy way to browse a jar for certain class files, but I couldn't find a clean one? If there was, we could just search for SchemaTuple_*.class and keep track of which files went across. Do you know if there is an easier way to search jars like that? I can dig around as well. tl;dr: As you know what classes you generate I would go for a mechanism where you provide the list of classes to load like a properties file in the jar. I don't think there is a clean way of scanning a package out of the box. Usually dependency injection containers (like Spring) have a mechanism for scanning the classpath for classes with a given annotation. example: http://static.springsource.org/spring/docs/3.0.0.M3/spring-framework-reference/html/ch04s12.html (I'm not saying we should add dependencies to Pig! Just FYI) You could have a simple scanning mechanism using getResource(): http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/ClassLoader.html#getResource%28java.lang.String%29 this.getClass().getClassLoader().getResource("my/package/path") will return a URL and you can scan yourself the location of the URL. If the package is not in a jar and on the local file system, you will get a file:/ ... url that you can easily make a File of. If the package is in a jar, then you will need to use a JarFile to scan the content of the package in the jar. The issue here is that the classloading mechanism is extensible: The URL could be a HTTP url or some other that do not provide a listing option. There is no way to have a fully generic scanning mechanism if you don't know the name of the classes you want. A search for "enumerate classes in a package" returns examples of people doing this. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209 > > > As we have schema aware Tuples, in the future we could add get(String fieldName). We may not want to strip the aliases to reuse the class. Jonathan Coveney wrote: Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" except with different names would necessitate a different class. More importantly, however, I ran into cases where schema names were different due to the way that names were generated by different pieces of code, and so the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think the win from get(String) is worth the potential complication? It could be convenient... I would need to look more into it. I would be interested in seeing the problems you see when not stripping the names. This seems changeable later if needed, so let's keep it that way for now. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270 > > > Logger Jonathan Coveney wrote: woops, I usually mark these //remove and whack them before the patch goes out... no worries. Usually it can be useful to keep debug log statements for future modification. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310 > > > refactor DataByteArray to provide this? Jonathan Coveney wrote: So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static). private stuff can be safely moved around and refactored (from a compatibility point of view) Let's think of a way to have the same code used in all cases. There used to be one type of tuple so what made sense before may need to changee. parent class for both ? TupleSerializer class ? static helpers? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325 > > > Same comment, the String serialization should be shared with the regular Tuple. > > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) Jonathan Coveney wrote: This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs). Following your goal of reducing memory utilization and as the serialized format is UTF8, we could keep UTF8 as the memory representation instead of String. Java Strings are UTF16 which is minimum 2 bytes per characters. If the data is mostly ascii, this is wasteful. To maintain backward compatibility the get(int) would lazily convert to String on demand, possibly caching the result in a second field to save processing, but this may be overkill. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452 > > > is (null, null, null) the same as null ? Jonathan Coveney wrote: In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly. agreed, we should just make sure this is the same behavior as the DefaultTuple On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474 > > > this.set(t) ? Jonathan Coveney wrote: reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects. "the underlying structures are primitives, not objects" for now, as Bags and Maps will come. For backward compatibility, I think we should implement this as a set(). It seems this is intended as a cheap set, not sure if "modifying one modifies the others" behavior is expected. some UDFs could depend on this. Also let's mark reference() as deprecated right now so that we can remove it later. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484 > > > We could have a list implementation that just wraps a tuple. Jonathan Coveney wrote: Hmm, I think this is an intriguing idea, though it might get sills, because that means you'd have a list which wraps a tuple which wraps a list. this is just for saving a list allocation. A wrapper list would use less memory. We can punt this for now. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495 > > > yep, sounds like it could be generated and fall back to this if the other is a regular tuple Jonathan Coveney wrote: My working copy almost has this finished. there are three comparison functions: compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if you were actually given a SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. I also need to generate a RawComparator but I'd rather clean everything else up first (especially the layout of the API and where pieces live). cool On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535 > > > you may want to log compilation warnings/errors Jonathan Coveney wrote: Is there a need to log it if the runtime exception is going to be thrown anyway? Because those errors get written to stderr. But I suppose I could pipe them through the Logger, if that's what you mean? Ok. You can also get them programatically, but stderr is good too. Let's make sure we don't swamp the output of Pig with too many messages. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609 > > > the parameter t is thrown away ? > > they don't need to be public. Jonathan Coveney wrote: This is more for conciseness in the code that generates the SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I suppose there are other ways to do it that don't involve a method call... I go back and forth on that. But the reason is so later on I can do: add(" setPos_X(SchemaTuple.unbox(val, pos_X));"); which means that the right one will be based on the type of the field being filled (ie saving me a bunch of lines, but perhaps that should be trumped by cleanliness of the output code/API). Ah yeah, this comment is obsoleted by my other comment bellow. I should have removed it. I saw how it simplifies the generation later. See my proposal for this bellow. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698 > > > this file has 1500 lines > I think those classes have earned their own package. Jonathan Coveney wrote: Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified? I don't feel strongly about it. Maybe you want to separate the base class SchemaTuple (runtime) from the Generator ? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857 > > > this could be a method in the parent class Jonathan Coveney wrote: do you mean in SchemaTuple, or in TypeInFunctionStringOut? Where it makes sense to you. based on the way you removed everything that does not need to be generated, you could factor out the error handling and just generate "default: handleDefault(fieldNum)" in SchemaTuple? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138 > > > CheckIfNullString.masks > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) Jonathan Coveney wrote: I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here. My goal is to have the bit management thing implemented once so that it is easier to change/improve/bugfix. agreed that we should keep things simple. If we have a SchemaTupleClassGenerator that contains bit management helpers that could simplify things. Separating the runtime from the generation would reduce the amount of things you have to look at at the same time. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283 > > > This feels weird. > what about: > "SchemaTuple.unbox((" getBoxedTypeName() ")appendGet(diff))" > > getBoxedTypeName() { > switch (type) { > case (DataType.INTEGER): return "Integer"; > ... Jonathan Coveney wrote: I actually was able to replace it with: add(" case (" fieldNum "): return " fs.type ";"); sometimes I miss the forest for the trees. cool On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413 > > > use http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for all those string concatenations. Jonathan Coveney wrote: yeahh this is a good idea. I doubt the code gen takes up much time at all but given the gen'd files are potentially large, it's probably good practice a concatenation copies both strings, so concatenating large strings over and over can become slow. Note that "a" "B"+foo+bar "c" gets replaced by the compiler by new StringBuilder().append("a").append("B").append(foo).append(bar).append("c").toString() But you still concatenate those to other strings so will end up duplicating large strings. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522 > > > what does isObjNotTup mean? What is this case? Jonathan Coveney wrote: it was an obscure case that was obsolete. I removed it ok Julien ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- On 2012-04-05 01:31:00, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-05 01:31:00) Review request for Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/
          -----------------------------------------------------------

          (Updated 2012-04-06 17:20:20.566618)

          Review request for pig and Julien Le Dem.

          Summary
          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.
          https://issues.apache.org/jira/browse/PIG-2632

          Diffs


          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628
          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628
          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628
          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
          trunk/src/org/apache/pig/data/TupleFactory.java 1309628
          trunk/src/org/apache/pig/impl/PigContext.java 1309628
          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing
          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-06 17:20:20.566618) Review request for pig and Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188

          > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>

          >

          > I would make it:

          > TupleFactory.geInstanceForSchema().newTuple()

          Jonathan Coveney wrote:

          I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this.

          Julien Le Dem wrote:

          I meant:

          TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

          The change is relatively small, it is just moving this from one class to the other and getting rid of the Maps. PrimitiveTuple should follow the same logic.

          Asking for the schema in the factory initialization will simplify the code. As you say, you won't need to constantly look up the schema. Also pig is about writing a lot of tuples of the same type.

          The block of code (see 2 comments above) that triggers the code generation would just get the factories instead of actually generating Tuples. This is caused by the api asking for the schema in the wrong place.

          TupleFactory is an abstract class so it should be doable to add methods without breaking compatibility.

          If needed, backard compatibility can be maintained by having newTupleForSchema(inputSchemaForGen) call TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

          I'd like your thoughts fleshing out the TupleFactory interface for a SchemaTupleFactory (which imho at this point makes the most sense – TupleFactory.getInstanceForSchema(inputSchema); returns a SchemaTupleFactory which extends TupleFactory).

          newTuple();
          easy

          newTuple(int size);
          meaningless, throw an error?

          newTuple(List c);
          uses set(List object), throws error otherwise?

          newTuple(Object datum);
          meaningless?

          newTupleNoCopy(List list);
          same as newTuple(list);

          If that sounds reasonable, I like this approach

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>

          >

          > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema?

          Jonathan Coveney wrote:

          I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it.

          Julien Le Dem wrote:

          Correct but know you have to check this every time you update. I was thinking it would simplify the code a little if it was just the actual size of the tuple (including past the original size).

          Well, I guess that depends how we define the size of the tuple. To me, the size = number of fields (independently of whether they are set) + number of append fields. The reason we need this field, then, is to keep track of whether or not we've set one of the generated fields. We could redefine the size to be the largest non-null field, but that's going to make it slower for minimal gain. Thoughts?

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>

          >

          > could be computed statically by the generated class if you make a static getSchema(schemaString).

          Jonathan Coveney wrote:

          Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization?

          Julien Le Dem wrote:

          caching is fine.

          I meant that you could have the Schema object "cached" in a static field of the generated class

          instead of a runtime generation + cache

          in the generated class:

          static schemaString = "....";

          static Schema schema = Utils.getSchemaFromString(schemaString);

          public Schema getSchema() { bq. return schema; bq. }

          Ah, very good call

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>

          >

          > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ?

          Jonathan Coveney wrote:

          My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm.

          Julien Le Dem wrote:

          I'm fine with the chaining mechanism.

          We should try to avoid having both as it makes the code harder to read. You cant set and ignore the return value;

          good call

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>

          >

          > refactor DataByteArray to provide this?

          Jonathan Coveney wrote:

          So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static).

          Julien Le Dem wrote:

          private stuff can be safely moved around and refactored (from a compatibility point of view)

          Let's think of a way to have the same code used in all cases. There used to be one type of tuple so what made sense before may need to changee.

          - parent class for both ?

          - TupleSerializer class ?

          - static helpers?

          per your suggestion and dmitriy's go ahead on the list serv, I'm moving a bit of the shared logic to a SedesHelper class which both BinInterSedes and SchemaTuple will use

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>

          >

          > Same comment, the String serialization should be shared with the regular Tuple.

          >

          > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result)

          Jonathan Coveney wrote:

          This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs).

          Julien Le Dem wrote:

          Following your goal of reducing memory utilization and as the serialized format is UTF8, we could keep UTF8 as the memory representation instead of String.

          Java Strings are UTF16 which is minimum 2 bytes per characters. If the data is mostly ascii, this is wasteful.

          To maintain backward compatibility the get(int) would lazily convert to String on demand, possibly caching the result in a second field to save processing, but this may be overkill.

          What is the latency on utf8 -> utf16 conversion? That's all I'd be worried about but your suggestion is a good one.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>

          >

          > is (null, null, null) the same as null ?

          Jonathan Coveney wrote:

          In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly.

          Julien Le Dem wrote:

          agreed, we should just make sure this is the same behavior as the DefaultTuple

          DefaultTuple doesn't implement isNull(). It's just a dummy null implementation. But the difference is that I have fields that can be "null" but aren't actually null (ie primitives), so it's a bit more useful.

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>

          >

          > this.set(t) ?

          Jonathan Coveney wrote:

          reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects.

          Julien Le Dem wrote:

          "the underlying structures are primitives, not objects"

          for now, as Bags and Maps will come.

          For backward compatibility, I think we should implement this as a set(). It seems this is intended as a cheap set, not sure if "modifying one modifies the others" behavior is expected. some UDFs could depend on this.

          Also let's mark reference() as deprecated right now so that we can remove it later.

          I have no problem making this a set. The nice part about doing that is it means you can essentially get the functionality of a set without having to cast it. The downside being the weird semantics around it. Probably a net win. Agree on deprecating reference

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>

          >

          > this file has 1500 lines

          > I think those classes have earned their own package.

          Jonathan Coveney wrote:

          Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified?

          Julien Le Dem wrote:

          I don't feel strongly about it.

          Maybe you want to separate the base class SchemaTuple (runtime) from the Generator ?

          Agreed, very good suggestion

          On 2012-04-06 03:30:08, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138

          > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>

          >

          > CheckIfNullString.masks

          > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...)

          Jonathan Coveney wrote:

          I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here.

          Julien Le Dem wrote:

          My goal is to have the bit management thing implemented once so that it is easier to change/improve/bugfix.

          agreed that we should keep things simple.

          If we have a SchemaTupleClassGenerator that contains bit management helpers that could simplify things.

          Separating the runtime from the generation would reduce the amount of things you have to look at at the same time.

          +1

          • Jonathan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6722
          -----------------------------------------------------------

          On 2012-04-06 17:20:20, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-06 17:20:20)

          Review request for pig and Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628

          trunk/src/org/apache/pig/data/BinInterSedes.java 1309628

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/TupleFactory.java 1309628

          trunk/src/org/apache/pig/impl/PigContext.java 1309628

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, line 188 > < https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188 > > > I would make it: > TupleFactory.geInstanceForSchema().newTuple() Jonathan Coveney wrote: I think it's worth thinking about how we want to do this. Dmitriy implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for this... however I agree, I think there should be a "SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the same time, that might be a lot of scaffolding for not a lot of gain: once you generate the code for a given SchemaTuple, there's not a lot of work to be done (though this could encapsulate some of the static maps that come later). One other factor is that the TupleFactory interface doesn't really lend itself very nicely to the SchemaTuple, but I do think it could be extended and be made to work. Would appreciate more of your thoughts on this. Julien Le Dem wrote: I meant: TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple() The change is relatively small, it is just moving this from one class to the other and getting rid of the Maps. PrimitiveTuple should follow the same logic. Asking for the schema in the factory initialization will simplify the code. As you say, you won't need to constantly look up the schema. Also pig is about writing a lot of tuples of the same type. The block of code (see 2 comments above) that triggers the code generation would just get the factories instead of actually generating Tuples. This is caused by the api asking for the schema in the wrong place. TupleFactory is an abstract class so it should be doable to add methods without breaking compatibility. If needed, backard compatibility can be maintained by having newTupleForSchema(inputSchemaForGen) call TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple() I'd like your thoughts fleshing out the TupleFactory interface for a SchemaTupleFactory (which imho at this point makes the most sense – TupleFactory.getInstanceForSchema(inputSchema); returns a SchemaTupleFactory which extends TupleFactory). newTuple(); easy newTuple(int size); meaningless, throw an error? newTuple(List c); uses set(List object), throws error otherwise? newTuple(Object datum); meaningless? newTupleNoCopy(List list); same as newTuple(list); If that sounds reasonable, I like this approach On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70 > > > largestSetValue should already be correct. Why does largestSetValue stays at the size of the schema? Jonathan Coveney wrote: I guess once it is at the value of the SchemaTuple (excluding appends), there's no need to update it. Julien Le Dem wrote: Correct but know you have to check this every time you update. I was thinking it would simplify the code a little if it was just the actual size of the tuple (including past the original size). Well, I guess that depends how we define the size of the tuple. To me, the size = number of fields (independently of whether they are set) + number of append fields. The reason we need this field, then, is to keep track of whether or not we've set one of the generated fields. We could redefine the size to be the largest non-null field, but that's going to make it slower for minimal gain. Thoughts? On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130 > > > could be computed statically by the generated class if you make a static getSchema(schemaString). Jonathan Coveney wrote: Well, my thought is that a static cache wouldn't take up much memory, and it would avoid having to recreate the Schema object every time getSchema() is called. I don't forsee it being called a ton, but possibly enough that recreating the object a bunch would be wasteful. Perhaps this is premature optimization? Julien Le Dem wrote: caching is fine. I meant that you could have the Schema object "cached" in a static field of the generated class instead of a runtime generation + cache in the generated class: static schemaString = "...."; static Schema schema = Utils.getSchemaFromString(schemaString); public Schema getSchema() { bq. return schema; bq. } Ah, very good call On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159 > > > do we need all of set(Tuple), set(List), setEqualTo(List), setEqualTo(Tuple) ? Jonathan Coveney wrote: My working version adds even more. I think a lot of it should be made protected, or at least, I should be more thoughtful about what it should be. I think "set" is probably enough, but perhaps it should just be the void version? I guess this is where working with ruby where everything generally returns "this" by default, thus allowing for nice chaining of methods, is the norm. Julien Le Dem wrote: I'm fine with the chaining mechanism. We should try to avoid having both as it makes the code harder to read. You cant set and ignore the return value; good call On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310 > > > refactor DataByteArray to provide this? Jonathan Coveney wrote: So here is the thing about this. Currently, a ton of the useful methods in BinInterSedes (where this logic was taken from) is private. Also, they aren't static (though there's no real reason why they shouldn't be?) This snippet (and other logic like it) should probably be made into a protected static method of BinInterSedes. Thoughts? It could also make sense for classes to have a static method to serialize themselves, but I'd argue that the BinInterSedes approach is probably more consistent with how pig is laid out (though I have no idea why most of the methods there are private instead of protected static). Julien Le Dem wrote: private stuff can be safely moved around and refactored (from a compatibility point of view) Let's think of a way to have the same code used in all cases. There used to be one type of tuple so what made sense before may need to changee. - parent class for both ? - TupleSerializer class ? - static helpers? per your suggestion and dmitriy's go ahead on the list serv, I'm moving a bit of the shared logic to a SedesHelper class which both BinInterSedes and SchemaTuple will use On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325 > > > Same comment, the String serialization should be shared with the regular Tuple. > > Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and convert lazilyto String (possibly caching the result) Jonathan Coveney wrote: This logic is stolen directly from BinInterSedes (see comment above). Can you flesh out what you mean about UTF8? If we change it here we should probably change it there as well (and make that the canonical protected static source of all your string serialization needs). Julien Le Dem wrote: Following your goal of reducing memory utilization and as the serialized format is UTF8, we could keep UTF8 as the memory representation instead of String. Java Strings are UTF16 which is minimum 2 bytes per characters. If the data is mostly ascii, this is wasteful. To maintain backward compatibility the get(int) would lazily convert to String on demand, possibly caching the result in a second field to save processing, but this may be overkill. What is the latency on utf8 -> utf16 conversion? That's all I'd be worried about but your suggestion is a good one. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452 > > > is (null, null, null) the same as null ? Jonathan Coveney wrote: In DefaultTuple, this is deprecated and not really used. Given that, I use this to basically mean "has any information been written to this." So in your case, both would be null. The more important question is to make sure I use it consistently throughout. Honestly, the whole null thing is a pain, and I need to really comb over things to make sure I incorporated it properly. Julien Le Dem wrote: agreed, we should just make sure this is the same behavior as the DefaultTuple DefaultTuple doesn't implement isNull(). It's just a dummy null implementation. But the difference is that I have fields that can be "null" but aren't actually null (ie primitives), so it's a bit more useful. On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474 > > > this.set(t) ? Jonathan Coveney wrote: reference is this weird Franken-method that isn't used anywhere in the codebase. I don't know that I want to implement it when the semantics of what it should do don't seem clear at all. Open to thoughts on this though. The original intent of it doesn't make much sense for a SchemaTuple since the underlying structures are primitives, not objects. Julien Le Dem wrote: "the underlying structures are primitives, not objects" for now, as Bags and Maps will come. For backward compatibility, I think we should implement this as a set(). It seems this is intended as a cheap set, not sure if "modifying one modifies the others" behavior is expected. some UDFs could depend on this. Also let's mark reference() as deprecated right now so that we can remove it later. I have no problem making this a set. The nice part about doing that is it means you can essentially get the functionality of a set without having to cast it. The downside being the weird semantics around it. Probably a net win. Agree on deprecating reference On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698 > > > this file has 1500 lines > I think those classes have earned their own package. Jonathan Coveney wrote: Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I don't like the idea of splitting it out. Yeah, it makes the file larger, but it makes 0 sense outside of the context of what is going on here. Perhaps that should be rectified? Julien Le Dem wrote: I don't feel strongly about it. Maybe you want to separate the base class SchemaTuple (runtime) from the Generator ? Agreed, very good suggestion On 2012-04-06 03:30:08, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138 > < https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138 > > > CheckIfNullString.masks > maybe they should be in a Masks class then, along with the masking code generation (" & (byte)+...) Jonathan Coveney wrote: I don't know how worthwhile it is to split all of this stuff out into classes. Maybe it is. A lot of stuff is used once or twice in disparate places, and I fear that splitting it out more and more will just make people jump through 50 classes to understand what a line of code gen is doing. On the other hand, it would make fixing bugs much cleaner. What probably needs to be done is some sort of cleaner code generation framework... would appreciate more thoughts here. Julien Le Dem wrote: My goal is to have the bit management thing implemented once so that it is easier to change/improve/bugfix. agreed that we should keep things simple. If we have a SchemaTupleClassGenerator that contains bit management helpers that could simplify things. Separating the runtime from the generation would reduce the amount of things you have to look at at the same time. +1 Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6722 ----------------------------------------------------------- On 2012-04-06 17:20:20, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-06 17:20:20) Review request for pig and Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1309628 trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/TupleFactory.java 1309628 trunk/src/org/apache/pig/impl/PigContext.java 1309628 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1309628 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          Scott Carey added a comment -

          varints is a good idea. general question of where this logic should live still holds as well, as this logic is also directly ripped from BinInterSedes.

          If you go that far, it may be better to simply replace BinInterSedes with Avro, which also provides about half of the stuff in this patch for free but is missing a bunch of other things you'd need. If the schema is known, it can be mapped to an Avro schema, and then all the serialization and deserialization details would be free. Code gen could be done with a custom velocity template (an avro extensibility feature) instead of the manual string manipulation here. You'd need to have a custom Avro DatumWriter/DatumReader that handles these schema aware tuples and also handles the pig Tuple contracts.

          This also seems like it requires a JDK and not just a JRE, because you are using javax.tools.JavaCompiler. There are alternative approaches to creating classes at runtime with tools like ASM, CGLIB, or java's dynamic proxies to extend a class without generating Java strings first, but those have more of a learning curve. See also Jackson's MrBean feature http://jackson.codehaus.org/1.9.3/javadoc/org/codehaus/jackson/mrbean/BeanBuilder.html and its source for some interesting examples.

          I am very interested in this but do not have time to help out in the near term.

          Class.forName(className) uses the classLoader of the current class

          Be careful, this approach is dangerous and not OSGi friendly. You may want to consider using the thread local context class loader, or passing in a ClassLoader as a parameter instead of the static binding which ClassLoader is used with a reference to a specific class. Even Main.class could have two copies in two different class loaders. If you know for sure a single class that must be in the same ClassLoader as the classes you are instantiating, it may make sense to get that one and cache it for all your uses.

          Show
          Scott Carey added a comment - varints is a good idea. general question of where this logic should live still holds as well, as this logic is also directly ripped from BinInterSedes. If you go that far, it may be better to simply replace BinInterSedes with Avro, which also provides about half of the stuff in this patch for free but is missing a bunch of other things you'd need. If the schema is known, it can be mapped to an Avro schema, and then all the serialization and deserialization details would be free. Code gen could be done with a custom velocity template (an avro extensibility feature) instead of the manual string manipulation here. You'd need to have a custom Avro DatumWriter/DatumReader that handles these schema aware tuples and also handles the pig Tuple contracts. This also seems like it requires a JDK and not just a JRE, because you are using javax.tools.JavaCompiler. There are alternative approaches to creating classes at runtime with tools like ASM, CGLIB, or java's dynamic proxies to extend a class without generating Java strings first, but those have more of a learning curve. See also Jackson's MrBean feature http://jackson.codehaus.org/1.9.3/javadoc/org/codehaus/jackson/mrbean/BeanBuilder.html and its source for some interesting examples. I am very interested in this but do not have time to help out in the near term. Class.forName(className) uses the classLoader of the current class Be careful, this approach is dangerous and not OSGi friendly. You may want to consider using the thread local context class loader, or passing in a ClassLoader as a parameter instead of the static binding which ClassLoader is used with a reference to a specific class. Even Main.class could have two copies in two different class loaders. If you know for sure a single class that must be in the same ClassLoader as the classes you are instantiating, it may make sense to get that one and cache it for all your uses.
          Hide
          Jonathan Coveney added a comment -

          Scott, I'm cool doing the heavy lifting to get out 1.0, and comments like this are greatly appreciated. We talked about using Avro as well...I think it's a good idea. I think it's probably a good idea for the next gen of this patch?

          I'm more worried in the short term about your point about requiring the JDK... this has been brought up before. Do you think that would be prohibitive for people? I guess it could be an option people could set... but that seems annoying. Going the route of bytecode generation is probably where it should go, but...oy vey. I think that Avro+Bytecode would be a really awesome version 2.0.

          As far as the classloader issue, this sort of thing is where I am not as strong. Do you know of any good resources to read up on these sorts of issues?

          Show
          Jonathan Coveney added a comment - Scott, I'm cool doing the heavy lifting to get out 1.0, and comments like this are greatly appreciated. We talked about using Avro as well...I think it's a good idea. I think it's probably a good idea for the next gen of this patch? I'm more worried in the short term about your point about requiring the JDK... this has been brought up before. Do you think that would be prohibitive for people? I guess it could be an option people could set... but that seems annoying. Going the route of bytecode generation is probably where it should go, but...oy vey. I think that Avro+Bytecode would be a really awesome version 2.0. As far as the classloader issue, this sort of thing is where I am not as strong. Do you know of any good resources to read up on these sorts of issues?
          Hide
          Scott Carey added a comment -

          I completely agree, for a 1.0 version, it doesn't need to be super slick.

          This is a big move in the right direction for making Pig significantly more efficient, and this and Dmitry's prior work have exposed some warts that make these more difficult. I really think the best way to move this forward is to get something out there that just works (although there will be a need to be able to turn it off). We do want to make sure that it is as hidden as possible so that the implementation details can change going forward.

          I am hoping to have time to add some features to Avro that make using it for these type of use cases easier. For example, Avro could operate directly on Pig Schemas without explicit translation if it had some rules on how to interpret them as a schema, and a framework for understanding such rules.

          The JRE/JDK thing is not a big deal for me at all, but is one of those things that tends to get someone somewhere complaining. If the feature can be turned off easily, then that may be good enough.

          (aside)
          If anyone has time to learn how to use ASM and dynamically generate classes for use cases like this, I'm sure it would be useful to them in getting a job in the future. Supply/demand and all that

          Show
          Scott Carey added a comment - I completely agree, for a 1.0 version, it doesn't need to be super slick. This is a big move in the right direction for making Pig significantly more efficient, and this and Dmitry's prior work have exposed some warts that make these more difficult. I really think the best way to move this forward is to get something out there that just works (although there will be a need to be able to turn it off). We do want to make sure that it is as hidden as possible so that the implementation details can change going forward. I am hoping to have time to add some features to Avro that make using it for these type of use cases easier. For example, Avro could operate directly on Pig Schemas without explicit translation if it had some rules on how to interpret them as a schema, and a framework for understanding such rules. The JRE/JDK thing is not a big deal for me at all, but is one of those things that tends to get someone somewhere complaining. If the feature can be turned off easily, then that may be good enough. (aside) If anyone has time to learn how to use ASM and dynamically generate classes for use cases like this, I'm sure it would be useful to them in getting a job in the future. Supply/demand and all that
          Hide
          Julien Le Dem added a comment -
          • Class loading:
            The main issue issue I see is for long running process that execute many Pig queries. In that case Loading classes in the application ClassLoader would be a memory leak and they could possibly fill up the perm space. We don't really need the classes on the FrontEnd for the Map/Reduce execution mode, so the class definition can just be added to the job jar. If we want to use the generated tuples on the Frontend (for example for local mode) they can be generated in a location outside of the classpath (like a temporary folder). Then we can use a URLClassLoader pointing to this location. Discarding the classloader after the execution would let the garbage collector free up this memory. We can even extend ClassLoader so that bytes don't even have to be written to disk.
          • ASM:
            I looked into ASM and the corresponding eclipse plugin. It is pretty cool. You can take the class that was generated and ask the plugin to generate the ASM code that would do the same (not just the bytecode). That should make it relatively easy to move from java source generation to directly code generation.
          • evolution of the generation and fail safe:
            This should be hidden behing a factory so that it can be changed easily. Also if anything goes wrong with generation it should fall back to regular tuple.
            The data storage format being modified is the intermediary format in between Pig jobs or for spills, so we don't need to maintain backward compatibility. Correct?
            If we stick with javacode gen for the first version, it should be easy to check if javax.tools.JavaCompiler is present at runtime and fall back to regular tuples.
          Show
          Julien Le Dem added a comment - Class loading: The main issue issue I see is for long running process that execute many Pig queries. In that case Loading classes in the application ClassLoader would be a memory leak and they could possibly fill up the perm space. We don't really need the classes on the FrontEnd for the Map/Reduce execution mode, so the class definition can just be added to the job jar. If we want to use the generated tuples on the Frontend (for example for local mode) they can be generated in a location outside of the classpath (like a temporary folder). Then we can use a URLClassLoader pointing to this location. Discarding the classloader after the execution would let the garbage collector free up this memory. We can even extend ClassLoader so that bytes don't even have to be written to disk. ASM: I looked into ASM and the corresponding eclipse plugin. It is pretty cool. You can take the class that was generated and ask the plugin to generate the ASM code that would do the same (not just the bytecode). That should make it relatively easy to move from java source generation to directly code generation. evolution of the generation and fail safe: This should be hidden behing a factory so that it can be changed easily. Also if anything goes wrong with generation it should fall back to regular tuple. The data storage format being modified is the intermediary format in between Pig jobs or for spills, so we don't need to maintain backward compatibility. Correct? If we stick with javacode gen for the first version, it should be easy to check if javax.tools.JavaCompiler is present at runtime and fall back to regular tuples.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/
          -----------------------------------------------------------

          (Updated 2012-04-08 07:19:58.096859)

          Review request for pig and Julien Le Dem.

          Changes
          -------

          Julien, this incorporates many of your comments, but not all. Mainly, it has the refactoring of the code. A couple existant issues:

          • The classloading is still janky. I'm not quite sure what the best approach is
          • I need to figure out how to register the classes I generate in the jar manifest
          • because of the way the code is generated, protected fields don't quite work. The code doesn't have a package, so only public methods are available. I marked the classes it dependended on as private, but I don't know if that is enough. If it's a big issue, I guess the next thing to do is to figure out how to generate code in a specific package of my choice, and ideally, how to generate the class in memory and add to the jar.
          • And of course some finer points: I need to implement a raw comparator, etc

          But I'd like to know if the general new structure works. Of course it's definitely a big time work in progress, but the comments really help.

          Lastly, I'd like to know how this should interact with PrimitiveTuples. I still think there is a place for them (since SchemaTuples have to be generated on the front end but PrimitiveTuples do not), but the whole TupleFactory.newTupleForSchema thing is weird... I went with a TupleFactory.getInstanceForSchema(Schema) approach and liked it a lot more. another question is what to do when the Schema can't be generated... one option is to just return a tuple, and another is to fail out. IMHO we should fail out, and require people to ensure it's generatable, but I can see the argument otherwise. In general, for things like this, I think it's better to fail early and explicitly than to let people think they have a special Tuple when they don't. Philosophies may differ.

          Summary
          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.
          https://issues.apache.org/jira/browse/PIG-2632

          Diffs (updated)


          trunk/bin/pig 1310666
          trunk/build.xml 1310666
          trunk/ivy.xml 1310666
          trunk/ivy/libraries.properties 1310666
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666
          trunk/src/org/apache/pig/data/BinInterSedes.java 1310666
          trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION
          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666
          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION
          trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION
          trunk/src/org/apache/pig/data/Tuple.java 1310666
          trunk/src/org/apache/pig/data/TupleFactory.java 1310666
          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666
          trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION
          trunk/src/org/apache/pig/impl/PigContext.java 1310666
          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

          Diff: https://reviews.apache.org/r/4651/diff

          Testing
          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-08 07:19:58.096859) Review request for pig and Julien Le Dem. Changes ------- Julien, this incorporates many of your comments, but not all. Mainly, it has the refactoring of the code. A couple existant issues: The classloading is still janky. I'm not quite sure what the best approach is I need to figure out how to register the classes I generate in the jar manifest because of the way the code is generated, protected fields don't quite work. The code doesn't have a package, so only public methods are available. I marked the classes it dependended on as private, but I don't know if that is enough. If it's a big issue, I guess the next thing to do is to figure out how to generate code in a specific package of my choice, and ideally, how to generate the class in memory and add to the jar. And of course some finer points: I need to implement a raw comparator, etc But I'd like to know if the general new structure works. Of course it's definitely a big time work in progress, but the comments really help. Lastly, I'd like to know how this should interact with PrimitiveTuples. I still think there is a place for them (since SchemaTuples have to be generated on the front end but PrimitiveTuples do not), but the whole TupleFactory.newTupleForSchema thing is weird... I went with a TupleFactory.getInstanceForSchema(Schema) approach and liked it a lot more. another question is what to do when the Schema can't be generated... one option is to just return a tuple, and another is to fail out. IMHO we should fail out, and require people to ensure it's generatable, but I can see the argument otherwise. In general, for things like this, I think it's better to fail early and explicitly than to let people think they have a special Tuple when they don't. Philosophies may differ. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs (updated) trunk/bin/pig 1310666 trunk/build.xml 1310666 trunk/ivy.xml 1310666 trunk/ivy/libraries.properties 1310666 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION trunk/src/org/apache/pig/data/Tuple.java 1310666 trunk/src/org/apache/pig/data/TupleFactory.java 1310666 trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION trunk/src/org/apache/pig/impl/PigContext.java 1310666 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/
          -----------------------------------------------------------

          (Updated 2012-04-08 22:26:29.313675)

          Review request for pig and Julien Le Dem.

          Changes
          -------

          Attached is a version that passes ant test-commit. Nothing too huge, mainly some fixes in the generated code and some other cleanups.

          Summary
          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.
          https://issues.apache.org/jira/browse/PIG-2632

          Diffs (updated)


          trunk/bin/pig 1310666
          trunk/build.xml 1310666
          trunk/ivy.xml 1310666
          trunk/ivy/libraries.properties 1310666
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666
          trunk/src/org/apache/pig/data/BinInterSedes.java 1310666
          trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION
          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666
          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION
          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION
          trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION
          trunk/src/org/apache/pig/data/Tuple.java 1310666
          trunk/src/org/apache/pig/data/TupleFactory.java 1310666
          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666
          trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION
          trunk/src/org/apache/pig/impl/PigContext.java 1310666
          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

          Diff: https://reviews.apache.org/r/4651/diff

          Testing
          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-08 22:26:29.313675) Review request for pig and Julien Le Dem. Changes ------- Attached is a version that passes ant test-commit. Nothing too huge, mainly some fixes in the generated code and some other cleanups. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs (updated) trunk/bin/pig 1310666 trunk/build.xml 1310666 trunk/ivy.xml 1310666 trunk/ivy/libraries.properties 1310666 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION trunk/src/org/apache/pig/data/Tuple.java 1310666 trunk/src/org/apache/pig/data/TupleFactory.java 1310666 trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION trunk/src/org/apache/pig/impl/PigContext.java 1310666 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6784
          -----------------------------------------------------------

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
          <https://reviews.apache.org/r/4651/#comment15009>

          instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment15010>

          How complex is this class ? Not sure it is worth pulling the whole mahout just for this.

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment15011>

          append...() methods should not be null
          (part of your TODO list?)

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment15013>

          handle the case where other == null
          Here you get NullPointerException

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment15014>

          same comment

          trunk/src/org/apache/pig/data/SchemaTuple.java
          <https://reviews.apache.org/r/4651/#comment15015>

          If you override equals(), you should override hashCode(). 2 object that are equal must return the same hashcode

          trunk/src/org/apache/pig/data/TupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15052>

          that's where we need a proper classloader

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15016>

          try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known.

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15017>

          possibly we want to rename this and/or add something else for this file.

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15018>

          ?
          either get rid of the empty catch block or document why it is Ok.

          Here I don't see why it would be Ok that we can not instantiate that class

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15019>

          "org.apache.pig.generated." + classname ?

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15020>

          "org/apache/pig/generated/" + classname + ".class" ?

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15021>

          we should probably not compile in the current dir.
          Put it in the temp dir intead

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15022>

          this could be calling a method in the parent

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15023>

          this could be in the parent

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15024>

          stuff which is not calling a generated field directly should be pulled up.

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15025>

          same comment

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15026>

          same

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15027>

          this is wasteful. Maybe DataByteArray could provide a static compare(byte[], byte[]) ?

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15028>

          duplicated code

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15029>

          this can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15030>

          the exception handling can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15031>

          "setPos_" + fieldPos + "((Boolean)val);"

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15032>

          same with a mapping for the java object for the type

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15033>

          can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15034>

          you can just remove the call to box, it should work

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15035>

          this can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15036>

          can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15037>

          can be pulled up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15038>

          pull up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15039>

          pull up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15040>

          pull up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15041>

          pull up

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java
          <https://reviews.apache.org/r/4651/#comment15042>

          pull up

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15043>

          Ok, temporary

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15044>

          do we need this. Let's just fall back to regular tuple when generation fails, so that we don't need to keep those in sync

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15045>

          The factory could be generated as well so that we don't need to use reflection here.
          Reflection is slower to create new instances.

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15046>

          The parameter type is enough to differentiate.
          getSchemaTupleFactory(Schema schema)

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15047>

          same

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15048>

          getTupleClass()

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15049>

          getTupleClass()

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java
          <https://reviews.apache.org/r/4651/#comment15050>

          getTupleClass()

          trunk/src/org/apache/pig/data/Tuple.java
          <https://reviews.apache.org/r/4651/#comment15051>

          typo

          • Julien

          On 2012-04-08 22:26:29, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-08 22:26:29)

          Review request for pig and Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/bin/pig 1310666

          trunk/build.xml 1310666

          trunk/ivy.xml 1310666

          trunk/ivy/libraries.properties 1310666

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666

          trunk/src/org/apache/pig/data/BinInterSedes.java 1310666

          trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION

          trunk/src/org/apache/pig/data/Tuple.java 1310666

          trunk/src/org/apache/pig/data/TupleFactory.java 1310666

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666

          trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION

          trunk/src/org/apache/pig/impl/PigContext.java 1310666

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6784 ----------------------------------------------------------- trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java < https://reviews.apache.org/r/4651/#comment15009 > instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment15010 > How complex is this class ? Not sure it is worth pulling the whole mahout just for this. trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment15011 > append...() methods should not be null (part of your TODO list?) trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment15013 > handle the case where other == null Here you get NullPointerException trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment15014 > same comment trunk/src/org/apache/pig/data/SchemaTuple.java < https://reviews.apache.org/r/4651/#comment15015 > If you override equals(), you should override hashCode(). 2 object that are equal must return the same hashcode trunk/src/org/apache/pig/data/TupleFactory.java < https://reviews.apache.org/r/4651/#comment15052 > that's where we need a proper classloader trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15016 > try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15017 > possibly we want to rename this and/or add something else for this file. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15018 > ? either get rid of the empty catch block or document why it is Ok. Here I don't see why it would be Ok that we can not instantiate that class trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15019 > "org.apache.pig.generated." + classname ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15020 > "org/apache/pig/generated/" + classname + ".class" ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15021 > we should probably not compile in the current dir. Put it in the temp dir intead trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15022 > this could be calling a method in the parent trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15023 > this could be in the parent trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15024 > stuff which is not calling a generated field directly should be pulled up. trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15025 > same comment trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15026 > same trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15027 > this is wasteful. Maybe DataByteArray could provide a static compare(byte[], byte[]) ? trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15028 > duplicated code trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15029 > this can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15030 > the exception handling can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15031 > "setPos_" + fieldPos + "((Boolean)val);" trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15032 > same with a mapping for the java object for the type trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15033 > can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15034 > you can just remove the call to box, it should work trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15035 > this can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15036 > can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15037 > can be pulled up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15038 > pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15039 > pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15040 > pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15041 > pull up trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java < https://reviews.apache.org/r/4651/#comment15042 > pull up trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15043 > Ok, temporary trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15044 > do we need this. Let's just fall back to regular tuple when generation fails, so that we don't need to keep those in sync trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15045 > The factory could be generated as well so that we don't need to use reflection here. Reflection is slower to create new instances. trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15046 > The parameter type is enough to differentiate. getSchemaTupleFactory(Schema schema) trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15047 > same trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15048 > getTupleClass() trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15049 > getTupleClass() trunk/src/org/apache/pig/data/SchemaTupleFactory.java < https://reviews.apache.org/r/4651/#comment15050 > getTupleClass() trunk/src/org/apache/pig/data/Tuple.java < https://reviews.apache.org/r/4651/#comment15051 > typo Julien On 2012-04-08 22:26:29, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-08 22:26:29) Review request for pig and Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/bin/pig 1310666 trunk/build.xml 1310666 trunk/ivy.xml 1310666 trunk/ivy/libraries.properties 1310666 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION trunk/src/org/apache/pig/data/Tuple.java 1310666 trunk/src/org/apache/pig/data/TupleFactory.java 1310666 trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION trunk/src/org/apache/pig/impl/PigContext.java 1310666 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          >

          Thanks again for taking a look Julien! I think we are close to having the structure where we want it, there are just a couple of outstanding higher level points:

          • How should PigContext be referenced?
          • How should the intermediate compiled data be handle?
          • What should the semantics of getting a SchemaTupleFactory be?
          • Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple
          • What should and shouldn't be in the generated code

          And probably some more. I responded below, and hopefully we can keep honing in on something that looks good

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 135-136

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135>

          >

          > instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails.

          I don't know that I like that semantic though. In this case, in the client code it's clear that they know that they either are or are not getting a SchemaTupleFactory... if we default to a regular TupleFactory, then we might think we are getting a SchemaTuple (and all the speed and memory benefits) when we actually aren't. There are two cases where we would ask for a SchemaTupleFactory and it wouldn't be available: a Pig bug in our implementation, or user error. Shouldn't pig fail out intelligently instead of giving them a Factory that doesn't do what they think?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 31

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line31>

          >

          > How complex is this class ? Not sure it is worth pulling the whole mahout just for this.

          It's very not complicated. I'll put the logic into SedesHelper.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84>

          >

          > append...() methods should not be null

          > (part of your TODO list?)

          Not sure what you mean here

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 550

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line550>

          >

          > handle the case where other == null

          > Here you get NullPointerException

          good call, will fix everywhere

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 662

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line662>

          >

          > If you override equals(), you should override hashCode(). 2 object that are equal must return the same hashcode

          whoops, I thought I had. will do

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64>

          >

          > try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known.

          In the pig code that I looked at, this seemed by far the more common idiom (not that I love it, but it's pretty hard to get the PigContext in there). I can't really find a single example of code in this sort of path that doesn't get the PigContext statically. I could rewire Pig to try and make this possible, but do you think that it is worth it/do you see any examples in the code base that I missed where PigContext isn't called statically in this sort of case? There are some examples of more purely front end things, and that's how I'd pipe it in I suppose, but it seems like a big change for something that barely touches the codebase. I do think a JIRA that tackles this issue in pig is reasonable though... trying to decrease the amount of statically referenced jank. Await your thoughts.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71>

          >

          > possibly we want to rename this and/or add something else for this file.

          Question about class files: does a class file have to have the same name as the class it contains? I know this is true for .java files with the java compiler, but not sure if it is true of class files. If it is true of class files, then we can't muck with the name. If it isn't, then ok.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 86-87

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line86>

          >

          > ?

          > either get rid of the empty catch block or document why it is Ok.

          >

          > Here I don't see why it would be Ok that we can not instantiate that class

          Hmm, the idea was that if you couldn't instantiate you'd regenerate, but looking at it now you're right; it is odd that if we think we have the class, that we cannot instantiate it. I'll throw a runtime error.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127>

          >

          > "org.apache.pig.generated." + classname ?

          Java question: if we're managing the compilation, does the compiled class have to be in org/apache/pig/generated (or whatever) in the file directory?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138>

          >

          > "org/apache/pig/generated/" + classname + ".class" ?

          What's the win here? Or just general cleanliness?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140>

          >

          > we should probably not compile in the current dir.

          > Put it in the temp dir intead

          I will see if this is possible. Do you know if Pig has an official local temp directory? I found an hdfs temp directory flag, but couldn't find a local one.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 322

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line322>

          >

          > this is wasteful. Maybe DataByteArray could provide a static compare(byte[], byte[]) ?

          Lol, I checked and it exists. Will use.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 265-269

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265>

          >

          > stuff which is not calling a generated field directly should be pulled up.

          I don't know that I agree with this. Then we're going to have a bunch of really random methods that have no context getting called for no real reason (I guess so that the compiler will run it's checks instead of getting an error at compile time)? Some of the logic makes more sense to do this with (the comparison of appends above, for example), but do you think the win for highly specific pieces of code like this is worth it?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 392-396

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392>

          >

          > the exception handling can be pulled up

          Same as above. Should it really be our goal to move everything humanly possible into SchemaTuple?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518>

          >

          > same with a mapping for the java object for the type

          Not sure what you want the code to look like here. If you want to explicitly do all the casts, it will make the generated code bigger, and would seem to contradict your suggestion to call as much from SchemaTuple as possible?

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 559

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line559>

          >

          > you can just remove the call to box, it should work

          Good call

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 42

          > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line42>

          >

          > do we need this. Let's just fall back to regular tuple when generation fails, so that we don't need to keep those in sync

          See above.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 59

          > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line59>

          >

          > The factory could be generated as well so that we don't need to use reflection here.

          > Reflection is slower to create new instances.

          I didn't realize that doing Class.newInstance was slower than direct creation. I'll generate it.

          • Jonathan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6784
          -----------------------------------------------------------

          On 2012-04-08 22:26:29, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-08 22:26:29)

          Review request for pig and Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/bin/pig 1310666

          trunk/build.xml 1310666

          trunk/ivy.xml 1310666

          trunk/ivy/libraries.properties 1310666

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666

          trunk/src/org/apache/pig/data/BinInterSedes.java 1310666

          trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION

          trunk/src/org/apache/pig/data/Tuple.java 1310666

          trunk/src/org/apache/pig/data/TupleFactory.java 1310666

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666

          trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION

          trunk/src/org/apache/pig/impl/PigContext.java 1310666

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-04-09 05:34:39, Julien Le Dem wrote: > Thanks again for taking a look Julien! I think we are close to having the structure where we want it, there are just a couple of outstanding higher level points: How should PigContext be referenced? How should the intermediate compiled data be handle? What should the semantics of getting a SchemaTupleFactory be? Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple What should and shouldn't be in the generated code And probably some more. I responded below, and hopefully we can keep honing in on something that looks good On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 135-136 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135 > > > instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails. I don't know that I like that semantic though. In this case, in the client code it's clear that they know that they either are or are not getting a SchemaTupleFactory... if we default to a regular TupleFactory, then we might think we are getting a SchemaTuple (and all the speed and memory benefits) when we actually aren't. There are two cases where we would ask for a SchemaTupleFactory and it wouldn't be available: a Pig bug in our implementation, or user error. Shouldn't pig fail out intelligently instead of giving them a Factory that doesn't do what they think? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 31 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line31 > > > How complex is this class ? Not sure it is worth pulling the whole mahout just for this. It's very not complicated. I'll put the logic into SedesHelper. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84 > > > append...() methods should not be null > (part of your TODO list?) Not sure what you mean here On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 550 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line550 > > > handle the case where other == null > Here you get NullPointerException good call, will fix everywhere On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 662 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line662 > > > If you override equals(), you should override hashCode(). 2 object that are equal must return the same hashcode whoops, I thought I had. will do On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64 > > > try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known. In the pig code that I looked at, this seemed by far the more common idiom (not that I love it, but it's pretty hard to get the PigContext in there). I can't really find a single example of code in this sort of path that doesn't get the PigContext statically. I could rewire Pig to try and make this possible, but do you think that it is worth it/do you see any examples in the code base that I missed where PigContext isn't called statically in this sort of case? There are some examples of more purely front end things, and that's how I'd pipe it in I suppose, but it seems like a big change for something that barely touches the codebase. I do think a JIRA that tackles this issue in pig is reasonable though... trying to decrease the amount of statically referenced jank. Await your thoughts. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71 > > > possibly we want to rename this and/or add something else for this file. Question about class files: does a class file have to have the same name as the class it contains? I know this is true for .java files with the java compiler, but not sure if it is true of class files. If it is true of class files, then we can't muck with the name. If it isn't, then ok. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 86-87 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line86 > > > ? > either get rid of the empty catch block or document why it is Ok. > > Here I don't see why it would be Ok that we can not instantiate that class Hmm, the idea was that if you couldn't instantiate you'd regenerate, but looking at it now you're right; it is odd that if we think we have the class, that we cannot instantiate it. I'll throw a runtime error. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127 > > > "org.apache.pig.generated." + classname ? Java question: if we're managing the compilation, does the compiled class have to be in org/apache/pig/generated (or whatever) in the file directory? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138 > > > "org/apache/pig/generated/" + classname + ".class" ? What's the win here? Or just general cleanliness? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140 > > > we should probably not compile in the current dir. > Put it in the temp dir intead I will see if this is possible. Do you know if Pig has an official local temp directory? I found an hdfs temp directory flag, but couldn't find a local one. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 322 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line322 > > > this is wasteful. Maybe DataByteArray could provide a static compare(byte[], byte[]) ? Lol, I checked and it exists. Will use. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 265-269 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265 > > > stuff which is not calling a generated field directly should be pulled up. I don't know that I agree with this. Then we're going to have a bunch of really random methods that have no context getting called for no real reason (I guess so that the compiler will run it's checks instead of getting an error at compile time)? Some of the logic makes more sense to do this with (the comparison of appends above, for example), but do you think the win for highly specific pieces of code like this is worth it? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 392-396 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392 > > > the exception handling can be pulled up Same as above. Should it really be our goal to move everything humanly possible into SchemaTuple? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518 > > > same with a mapping for the java object for the type Not sure what you want the code to look like here. If you want to explicitly do all the casts, it will make the generated code bigger, and would seem to contradict your suggestion to call as much from SchemaTuple as possible? On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 559 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line559 > > > you can just remove the call to box, it should work Good call On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 42 > < https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line42 > > > do we need this. Let's just fall back to regular tuple when generation fails, so that we don't need to keep those in sync See above. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 59 > < https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line59 > > > The factory could be generated as well so that we don't need to use reflection here. > Reflection is slower to create new instances. I didn't realize that doing Class.newInstance was slower than direct creation. I'll generate it. Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6784 ----------------------------------------------------------- On 2012-04-08 22:26:29, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-08 22:26:29) Review request for pig and Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/bin/pig 1310666 trunk/build.xml 1310666 trunk/ivy.xml 1310666 trunk/ivy/libraries.properties 1310666 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION trunk/src/org/apache/pig/data/Tuple.java 1310666 trunk/src/org/apache/pig/data/TupleFactory.java 1310666 trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION trunk/src/org/apache/pig/impl/PigContext.java 1310666 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          >

          Jonathan Coveney wrote:

          Thanks again for taking a look Julien! I think we are close to having the structure where we want it, there are just a couple of outstanding higher level points:

          - How should PigContext be referenced?

          - How should the intermediate compiled data be handle?

          - What should the semantics of getting a SchemaTupleFactory be?

          - Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple

          - What should and shouldn't be in the generated code

          And probably some more. I responded below, and hopefully we can keep honing in on something that looks good

          • How should PigContext be referenced?
            We can leave it as is for now. It sounds like it's too big a refactor to be done in this patch
          • How should the intermediate compiled data be handle?
            For generated java code, I would say that the classes are generated in a temp directory and we use a URLClassLoader on the Frontend to access them. On the backend the class are just added to the job jar.
            Once we move to generated bytecode then don't need to write the bytecode to disk and can just define() classes in a custom classloader.
            In both cases if we don't hold a reference to the classloader or the classes they will be collected once we are done using them.
          • What should the semantics of getting a SchemaTupleFactory be?
            There are multiple places this could be used:
          • LoadFunc and EvalFunc: possibly through a getOutputTupleFactory() added to the base classes.
          • Spill of bags
          • Pig intermediary storage.
          • Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple
            The only thing you need is a subclass with an instantiate() implementation that directly calls the constructor.
          • What should and shouldn't be in the generated code
            The generated code should only have the parts that depend on "field_"+i and related methods and switch statements. Everything else should be in the parent. the default: cases of the switch statements can call super.set...() to handle the default case with the "append" tuple.
            This will make it easier to switch to bytecode generation.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 135-136

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135>

          >

          > instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails.

          Jonathan Coveney wrote:

          I don't know that I like that semantic though. In this case, in the client code it's clear that they know that they either are or are not getting a SchemaTupleFactory... if we default to a regular TupleFactory, then we might think we are getting a SchemaTuple (and all the speed and memory benefits) when we actually aren't. There are two cases where we would ask for a SchemaTupleFactory and it wouldn't be available: a Pig bug in our implementation, or user error. Shouldn't pig fail out intelligently instead of giving them a Factory that doesn't do what they think?

          It depends how the generated factory gets used. My idea was to make sure we don't break existing scripts and to have a fallback solution in case of problems. But if you have to explicitly use the new factory then maybe it is fine.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84

          > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84>

          >

          > append...() methods should not be null

          > (part of your TODO list?)

          Jonathan Coveney wrote:

          Not sure what you mean here

          Sorry, I meant: should not be public

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64>

          >

          > try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known.

          Jonathan Coveney wrote:

          In the pig code that I looked at, this seemed by far the more common idiom (not that I love it, but it's pretty hard to get the PigContext in there). I can't really find a single example of code in this sort of path that doesn't get the PigContext statically. I could rewire Pig to try and make this possible, but do you think that it is worth it/do you see any examples in the code base that I missed where PigContext isn't called statically in this sort of case? There are some examples of more purely front end things, and that's how I'd pipe it in I suppose, but it seems like a big change for something that barely touches the codebase. I do think a JIRA that tackles this issue in pig is reasonable though... trying to decrease the amount of statically referenced jank. Await your thoughts.

          If it requires too much refactoring let's not do it in this patch

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71>

          >

          > possibly we want to rename this and/or add something else for this file.

          Jonathan Coveney wrote:

          Question about class files: does a class file have to have the same name as the class it contains? I know this is true for .java files with the java compiler, but not sure if it is true of class files. If it is true of class files, then we can't muck with the name. If it isn't, then ok.

          yes the .class file should have the same name as the class it contains and the package name should match the path to the file.

          if addScriptFile() is not specific to scripts it should be renamed.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127>

          >

          > "org.apache.pig.generated." + classname ?

          Jonathan Coveney wrote:

          Java question: if we're managing the compilation, does the compiled class have to be in org/apache/pig/generated (or whatever) in the file directory?

          yes, the relative path to the class from the root given to the classloader (path or jar) should match the package of the class.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138>

          >

          > "org/apache/pig/generated/" + classname + ".class" ?

          Jonathan Coveney wrote:

          What's the win here? Or just general cleanliness?

          So that you have your own namespace to avoid conflicts and so that package protected members are visible.
          Actually, this should be the same package as your base class.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140>

          >

          > we should probably not compile in the current dir.

          > Put it in the temp dir intead

          Jonathan Coveney wrote:

          I will see if this is possible. Do you know if Pig has an official local temp directory? I found an hdfs temp directory flag, but couldn't find a local one.

          You can ask java to give you temporary files. You can reuse the path for a temporary directory.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 265-269

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265>

          >

          > stuff which is not calling a generated field directly should be pulled up.

          Jonathan Coveney wrote:

          I don't know that I agree with this. Then we're going to have a bunch of really random methods that have no context getting called for no real reason (I guess so that the compiler will run it's checks instead of getting an error at compile time)? Some of the logic makes more sense to do this with (the comparison of appends above, for example), but do you think the win for highly specific pieces of code like this is worth it?

          The idea is to make it a simple as possible for when we switch to bytecode generation. Agreed they should have meaningful names so that could be on a case by case basis. Use your best judgement

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 392-396

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392>

          >

          > the exception handling can be pulled up

          Jonathan Coveney wrote:

          Same as above. Should it really be our goal to move everything humanly possible into SchemaTuple?

          It will make bytecode generation easier. If it gets a little too extreme we can make exceptions.

          On 2012-04-09 05:34:39, Julien Le Dem wrote:

          > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518

          > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518>

          >

          > same with a mapping for the java object for the type

          Jonathan Coveney wrote:

          Not sure what you want the code to look like here. If you want to explicitly do all the casts, it will make the generated code bigger, and would seem to contradict your suggestion to call as much from SchemaTuple as possible?

          The unbox trick will not work in bytecode generation as you will have to explicitly provide the type.

          • Julien

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/4651/#review6784
          -----------------------------------------------------------

          On 2012-04-08 22:26:29, Jonathan Coveney wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/4651/

          -----------------------------------------------------------

          (Updated 2012-04-08 22:26:29)

          Review request for pig and Julien Le Dem.

          Summary

          -------

          This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

          Need to clean up the code and add tests.

          Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

          Needs tests and comments, but I want the code to settle a bit.

          This addresses bug PIG-2632.

          https://issues.apache.org/jira/browse/PIG-2632

          Diffs

          -----

          trunk/bin/pig 1310666

          trunk/build.xml 1310666

          trunk/ivy.xml 1310666

          trunk/ivy/libraries.properties 1310666

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666

          trunk/src/org/apache/pig/data/BinInterSedes.java 1310666

          trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION

          trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666

          trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION

          trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION

          trunk/src/org/apache/pig/data/Tuple.java 1310666

          trunk/src/org/apache/pig/data/TupleFactory.java 1310666

          trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666

          trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION

          trunk/src/org/apache/pig/impl/PigContext.java 1310666

          trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

          Diff: https://reviews.apache.org/r/4651/diff

          Testing

          -------

          Thanks,

          Jonathan

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-04-09 05:34:39, Julien Le Dem wrote: > Jonathan Coveney wrote: Thanks again for taking a look Julien! I think we are close to having the structure where we want it, there are just a couple of outstanding higher level points: - How should PigContext be referenced? - How should the intermediate compiled data be handle? - What should the semantics of getting a SchemaTupleFactory be? - Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple - What should and shouldn't be in the generated code And probably some more. I responded below, and hopefully we can keep honing in on something that looks good How should PigContext be referenced? We can leave it as is for now. It sounds like it's too big a refactor to be done in this patch How should the intermediate compiled data be handle? For generated java code, I would say that the classes are generated in a temp directory and we use a URLClassLoader on the Frontend to access them. On the backend the class are just added to the job jar. Once we move to generated bytecode then don't need to write the bytecode to disk and can just define() classes in a custom classloader. In both cases if we don't hold a reference to the classloader or the classes they will be collected once we are done using them. What should the semantics of getting a SchemaTupleFactory be? There are multiple places this could be used: LoadFunc and EvalFunc: possibly through a getOutputTupleFactory() added to the base classes. Spill of bags Pig intermediary storage. Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple The only thing you need is a subclass with an instantiate() implementation that directly calls the constructor. What should and shouldn't be in the generated code The generated code should only have the parts that depend on "field_"+i and related methods and switch statements. Everything else should be in the parent. the default: cases of the switch statements can call super.set...() to handle the default case with the "append" tuple. This will make it easier to switch to bytecode generation. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java, lines 135-136 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135 > > > instead of checking here if the tuple is generatable, the factory could default to the regular TupleFactory if the generation fails. Jonathan Coveney wrote: I don't know that I like that semantic though. In this case, in the client code it's clear that they know that they either are or are not getting a SchemaTupleFactory... if we default to a regular TupleFactory, then we might think we are getting a SchemaTuple (and all the speed and memory benefits) when we actually aren't. There are two cases where we would ask for a SchemaTupleFactory and it wouldn't be available: a Pig bug in our implementation, or user error. Shouldn't pig fail out intelligently instead of giving them a Factory that doesn't do what they think? It depends how the generated factory gets used. My idea was to make sure we don't break existing scripts and to have a fallback solution in case of problems. But if you have to explicitly use the new factory then maybe it is fine. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84 > < https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84 > > > append...() methods should not be null > (part of your TODO list?) Jonathan Coveney wrote: Not sure what you mean here Sorry, I meant: should not be public On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64 > > > try avoiding accessing the PigContext statically. Add it as a parameter and see if it can be passed from somewhere it is actually known. Jonathan Coveney wrote: In the pig code that I looked at, this seemed by far the more common idiom (not that I love it, but it's pretty hard to get the PigContext in there). I can't really find a single example of code in this sort of path that doesn't get the PigContext statically. I could rewire Pig to try and make this possible, but do you think that it is worth it/do you see any examples in the code base that I missed where PigContext isn't called statically in this sort of case? There are some examples of more purely front end things, and that's how I'd pipe it in I suppose, but it seems like a big change for something that barely touches the codebase. I do think a JIRA that tackles this issue in pig is reasonable though... trying to decrease the amount of statically referenced jank. Await your thoughts. If it requires too much refactoring let's not do it in this patch On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71 > > > possibly we want to rename this and/or add something else for this file. Jonathan Coveney wrote: Question about class files: does a class file have to have the same name as the class it contains? I know this is true for .java files with the java compiler, but not sure if it is true of class files. If it is true of class files, then we can't muck with the name. If it isn't, then ok. yes the .class file should have the same name as the class it contains and the package name should match the path to the file. if addScriptFile() is not specific to scripts it should be renamed. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127 > > > "org.apache.pig.generated." + classname ? Jonathan Coveney wrote: Java question: if we're managing the compilation, does the compiled class have to be in org/apache/pig/generated (or whatever) in the file directory? yes, the relative path to the class from the root given to the classloader (path or jar) should match the package of the class. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138 > > > "org/apache/pig/generated/" + classname + ".class" ? Jonathan Coveney wrote: What's the win here? Or just general cleanliness? So that you have your own namespace to avoid conflicts and so that package protected members are visible. Actually, this should be the same package as your base class. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140 > > > we should probably not compile in the current dir. > Put it in the temp dir intead Jonathan Coveney wrote: I will see if this is possible. Do you know if Pig has an official local temp directory? I found an hdfs temp directory flag, but couldn't find a local one. You can ask java to give you temporary files. You can reuse the path for a temporary directory. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 265-269 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265 > > > stuff which is not calling a generated field directly should be pulled up. Jonathan Coveney wrote: I don't know that I agree with this. Then we're going to have a bunch of really random methods that have no context getting called for no real reason (I guess so that the compiler will run it's checks instead of getting an error at compile time)? Some of the logic makes more sense to do this with (the comparison of appends above, for example), but do you think the win for highly specific pieces of code like this is worth it? The idea is to make it a simple as possible for when we switch to bytecode generation. Agreed they should have meaningful names so that could be on a case by case basis. Use your best judgement On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 392-396 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392 > > > the exception handling can be pulled up Jonathan Coveney wrote: Same as above. Should it really be our goal to move everything humanly possible into SchemaTuple? It will make bytecode generation easier. If it gets a little too extreme we can make exceptions. On 2012-04-09 05:34:39, Julien Le Dem wrote: > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518 > < https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518 > > > same with a mapping for the java object for the type Jonathan Coveney wrote: Not sure what you want the code to look like here. If you want to explicitly do all the casts, it will make the generated code bigger, and would seem to contradict your suggestion to call as much from SchemaTuple as possible? The unbox trick will not work in bytecode generation as you will have to explicitly provide the type. Julien ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/#review6784 ----------------------------------------------------------- On 2012-04-08 22:26:29, Jonathan Coveney wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4651/ ----------------------------------------------------------- (Updated 2012-04-08 22:26:29) Review request for pig and Julien Le Dem. Summary ------- This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge. Need to clean up the code and add tests. Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge. Needs tests and comments, but I want the code to settle a bit. This addresses bug PIG-2632 . https://issues.apache.org/jira/browse/PIG-2632 Diffs ----- trunk/bin/pig 1310666 trunk/build.xml 1310666 trunk/ivy.xml 1310666 trunk/ivy/libraries.properties 1310666 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1310666 trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION trunk/src/org/apache/pig/data/Tuple.java 1310666 trunk/src/org/apache/pig/data/TupleFactory.java 1310666 trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION trunk/src/org/apache/pig/impl/PigContext.java 1310666 trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 Diff: https://reviews.apache.org/r/4651/diff Testing ------- Thanks, Jonathan
          Hide
          Dmitriy V. Ryaboy added a comment -

          you know guys,
          at this point, I am starting to think we might as well generate auxillary methods like getFieldByName(String fieldName) .. and the whole thing starts looking a lot like protocol buffers.

          Remind me – why aren't we just using dynamic protocol buffers?

          Show
          Dmitriy V. Ryaboy added a comment - you know guys, at this point, I am starting to think we might as well generate auxillary methods like getFieldByName(String fieldName) .. and the whole thing starts looking a lot like protocol buffers. Remind me – why aren't we just using dynamic protocol buffers?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Before Jon kills me: one reason we aren't doing dynamic protobufs is that they might be slower, not having structures optimized for the represented schemas. Generated protobufs might be as efficient, or more so, but need the protoc binary installed, which is clearly too heavy a requirement.

          Show
          Dmitriy V. Ryaboy added a comment - Before Jon kills me: one reason we aren't doing dynamic protobufs is that they might be slower, not having structures optimized for the represented schemas. Generated protobufs might be as efficient, or more so, but need the protoc binary installed, which is clearly too heavy a requirement.
          Hide
          Jonathan Coveney added a comment -

          Haha, I won't kill anyone if we find a better approach. I've learned a ton in the implementation, and I think that it has gotten some momentum in order to help optimize pig.

          I assume protoc has some whacky license that makes it unreasonable? Or is it just a jar size thing?

          Show
          Jonathan Coveney added a comment - Haha, I won't kill anyone if we find a better approach. I've learned a ton in the implementation, and I think that it has gotten some momentum in order to help optimize pig. I assume protoc has some whacky license that makes it unreasonable? Or is it just a jar size thing?
          Hide
          Jonathan Coveney added a comment -

          One last note: implementing getfield(String) would be trivial at this point. The only reason I haven't done so yet is because right now there is a debate about how strict we should be w.r.t. Schema -> SchemaTuple (ie should "a:int,b:int" really generate a different class than "x:int,y:int"? maybe!), but it would be very easy to add.

          But I'm open to suggestions about using another route.

          Show
          Jonathan Coveney added a comment - One last note: implementing getfield(String) would be trivial at this point. The only reason I haven't done so yet is because right now there is a debate about how strict we should be w.r.t. Schema -> SchemaTuple (ie should "a:int,b:int" really generate a different class than "x:int,y:int"? maybe!), but it would be very easy to add. But I'm open to suggestions about using another route.
          Hide
          Dmitriy V. Ryaboy added a comment -

          that's a good point, actually. What would we get from not making them different classes? I mean, if we were actually writing these classes and maintaining them in the codebase, the gain is obvious. But when they are generated and discarded on the fly.. why not?

          Show
          Dmitriy V. Ryaboy added a comment - that's a good point, actually. What would we get from not making them different classes? I mean, if we were actually writing these classes and maintaining them in the codebase, the gain is obvious. But when they are generated and discarded on the fly.. why not?
          Hide
          Scott Carey added a comment -

          should "a:int,b:int" really generate a different class than "x:int,y:int"? maybe!

          They could be the same class, but both support getField(String) if the constructor took a reference to the Schema that has the field name details or a datastructure that maps fields names to indexes. That isn't trivial though, and again its something that Avro could give you for free.

          I think ironing out the kinks here before going for extra features is the way to go, we can leverage other tools for advanced features in a later version.

          I will review more late this week or next week.

          import org.apache.mahout.math.Varint;
          

          I am not a fan of requiring another library on my already crowded hadoop classpath that might have interesting version conflicts. Is mahout already required by Pig? Doesn't hadoop have a variable length int already for some of its file formats?

          Show
          Scott Carey added a comment - should "a:int,b:int" really generate a different class than "x:int,y:int"? maybe! They could be the same class, but both support getField(String) if the constructor took a reference to the Schema that has the field name details or a datastructure that maps fields names to indexes. That isn't trivial though, and again its something that Avro could give you for free. I think ironing out the kinks here before going for extra features is the way to go, we can leverage other tools for advanced features in a later version. I will review more late this week or next week. import org.apache.mahout.math.Varint; I am not a fan of requiring another library on my already crowded hadoop classpath that might have interesting version conflicts. Is mahout already required by Pig? Doesn't hadoop have a variable length int already for some of its file formats?
          Hide
          Jonathan Coveney added a comment -

          I agree 100% that it's time to iron out kinks. Just trying to figure out what those kinks are, realy.

          As far as Varint, I went ahead and just included the logic directly. It' 8 ~3 line functions, not worth the dependency.

          Show
          Jonathan Coveney added a comment - I agree 100% that it's time to iron out kinks. Just trying to figure out what those kinks are, realy. As far as Varint, I went ahead and just included the logic directly. It' 8 ~3 line functions, not worth the dependency.
          Hide
          Dmitriy V. Ryaboy added a comment -

          remember to attribute the code to mahout project by adding a notice to NOTICE.txt

          Scott, does Avro have a java generator or is it c-based, like protobufs and thrift (Jon – the reason protoc is a heavy dependency is that it's a c binary, not a jar).

          Show
          Dmitriy V. Ryaboy added a comment - remember to attribute the code to mahout project by adding a notice to NOTICE.txt Scott, does Avro have a java generator or is it c-based, like protobufs and thrift (Jon – the reason protoc is a heavy dependency is that it's a c binary, not a jar).
          Hide
          Jonathan Coveney added a comment -

          Ah, gotcha.

          As far as the NOTICE.txt, I don't see any other apache products in there? Is it necessary to include Apache projects in the NOTICE? I didn't think that it is necessary. This FAQ seems to agree but I am deeply ignorant on licensing issues:
          http://www.apache.org/dev/release.html#notice-content

          I guess the question is if Mahout took it from somewhere else?

          Show
          Jonathan Coveney added a comment - Ah, gotcha. As far as the NOTICE.txt, I don't see any other apache products in there? Is it necessary to include Apache projects in the NOTICE? I didn't think that it is necessary. This FAQ seems to agree but I am deeply ignorant on licensing issues: http://www.apache.org/dev/release.html#notice-content I guess the question is if Mahout took it from somewhere else?
          Hide
          Julien Le Dem added a comment -

          By design Avro defines the binary format and the schema as the reference; not the code generator.
          There is a code generator in java.
          Avro also supports "dynamic" serialization/deserialization so that you can read/write without generating the code.

          Show
          Julien Le Dem added a comment - By design Avro defines the binary format and the schema as the reference; not the code generator. There is a code generator in java. Avro also supports "dynamic" serialization/deserialization so that you can read/write without generating the code.
          Hide
          Scott Carey added a comment -

          Avro has a pure Java code generator that uses the Velocity template engine. The bulit-in templates produce SpecificRecord objects with builder-pattern semantics. Custom templates can be loaded from a classpath, so you can convert a schema to whatever you wish – java source, yaml, c#, etc.

          The challenge with Pig and Avro for this use case is transforming a o.a.pig.Schema into an o.a.a.Schema. However, there is source code that does this at: https://issues.apache.org/jira/browse/AVRO-592

          Any future bytecode generation would resemble the template approach – a schema would decorate a (source code template / byte code fragments) to produce an output. Both would call pre-existing methods on a superclass that does most of the work. Refining a template to be as small and simple as possible corresponds with refining bytecode generation to do the minimum necessary.

          Show
          Scott Carey added a comment - Avro has a pure Java code generator that uses the Velocity template engine. The bulit-in templates produce SpecificRecord objects with builder-pattern semantics. Custom templates can be loaded from a classpath, so you can convert a schema to whatever you wish – java source, yaml, c#, etc. The challenge with Pig and Avro for this use case is transforming a o.a.pig.Schema into an o.a.a.Schema. However, there is source code that does this at: https://issues.apache.org/jira/browse/AVRO-592 Any future bytecode generation would resemble the template approach – a schema would decorate a (source code template / byte code fragments) to produce an output. Both would call pre-existing methods on a superclass that does most of the work. Refining a template to be as small and simple as possible corresponds with refining bytecode generation to do the minimum necessary.
          Hide
          Julien Le Dem added a comment -

          @Jonathan: regarding bit manipulation in the patch, you might want to use java.util.BitSet

          Show
          Julien Le Dem added a comment - @Jonathan: regarding bit manipulation in the patch, you might want to use java.util.BitSet
          Hide
          Jonathan Coveney added a comment -

          Powerpoint?! What is this, Wall Street? I know, I know. But we thought it would be instructive to benchmark SchemaTuple versus the existing Tuple implementations (namely Tuple and PrimitiveTuple) to see what sort of gains are possible.

          If the schema type wasn't mentioned, it's a long. In general it was all done with longs (for ease), except in the case of serialization where the long/int difference made a pretty big difference.

          Some results:

          • It takes a depressingly long time to make a tuple with a given size. Well, depressing being on a pretty small order, but it's because with a new Tuple of a given size it nulls out those values. We could alleviate this, but I don't know if the code complexity/slightly increased memory footprint would be worth it.
          • In general, the PrimitiveTuple performance is poor (though it does have a decreased memory footprint). There are some enhancements that would make it faster, but I think that SchemaTuple will end up making it completely obsolete.
          • The values that were set or serialized started at 0 and went up depending on how many values google calipers gave it. This was especially important for the size on disk of serialized values: SchemaTuple uses Varint, so obviously for smaller values it's going to be more compact. However, more of note, is that Tuple storage for longs is really really large. We can optimize it (I have a patch that does, but need to test the speed implications). After SchemaTuple will probably come SchemaBag, but after that will come some changes to serialization at the suggestion of Scott Carey that could be really huge.

          But basically, SchemaTuple is better than Tuples in every way (that it applies). The current patch uses it where it is possible with UDF's, but patch 1 will probably just be their existence (and perhaps using it with UDF's where the Schema is known), and then I'll incrementally add it (first candidate would be joins or anything internal to pig…the memory and speed benefits should be very beneficial).

          Show
          Jonathan Coveney added a comment - Powerpoint?! What is this, Wall Street? I know, I know. But we thought it would be instructive to benchmark SchemaTuple versus the existing Tuple implementations (namely Tuple and PrimitiveTuple) to see what sort of gains are possible. If the schema type wasn't mentioned, it's a long. In general it was all done with longs (for ease), except in the case of serialization where the long/int difference made a pretty big difference. Some results: It takes a depressingly long time to make a tuple with a given size. Well, depressing being on a pretty small order, but it's because with a new Tuple of a given size it nulls out those values. We could alleviate this, but I don't know if the code complexity/slightly increased memory footprint would be worth it. In general, the PrimitiveTuple performance is poor (though it does have a decreased memory footprint). There are some enhancements that would make it faster, but I think that SchemaTuple will end up making it completely obsolete. The values that were set or serialized started at 0 and went up depending on how many values google calipers gave it. This was especially important for the size on disk of serialized values: SchemaTuple uses Varint, so obviously for smaller values it's going to be more compact. However, more of note, is that Tuple storage for longs is really really large. We can optimize it (I have a patch that does, but need to test the speed implications). After SchemaTuple will probably come SchemaBag, but after that will come some changes to serialization at the suggestion of Scott Carey that could be really huge. But basically, SchemaTuple is better than Tuples in every way (that it applies). The current patch uses it where it is possible with UDF's, but patch 1 will probably just be their existence (and perhaps using it with UDF's where the Schema is known), and then I'll incrementally add it (first candidate would be joins or anything internal to pig…the memory and speed benefits should be very beneficial).
          Hide
          Jonathan Coveney added a comment -

          Hmm, so I was testing my benchmarks, and the Varint/varlong CPU cost is higher than the benchmark was capturing. For large longs, the it can be even 3-4x slower (this came out of work for PIG-2638, and in that case I came up with a method that should give the same benefit and be more performant, but it won't apply to this case). I may just switch to simple "store the whole long" and hope intermediate compression is turned on and effective, but that seems unsatisfying to me. Will ruminate on that. Perhaps this is the part where Scott says Pig should use Avro for the intermediate serialization again

          Show
          Jonathan Coveney added a comment - Hmm, so I was testing my benchmarks, and the Varint/varlong CPU cost is higher than the benchmark was capturing. For large longs, the it can be even 3-4x slower (this came out of work for PIG-2638 , and in that case I came up with a method that should give the same benefit and be more performant, but it won't apply to this case). I may just switch to simple "store the whole long" and hope intermediate compression is turned on and effective, but that seems unsatisfying to me. Will ruminate on that. Perhaps this is the part where Scott says Pig should use Avro for the intermediate serialization again
          Hide
          Julien Le Dem added a comment -

          Hi Jonathan,
          This is all great, I would tend to leave the varint aside in separate JIRA for now.
          There is enough work in this one already.

          Show
          Julien Le Dem added a comment - Hi Jonathan, This is all great, I would tend to leave the varint aside in separate JIRA for now. There is enough work in this one already.
          Hide
          Jonathan Coveney added a comment -

          Oh snap! An update! Indeed. The code and logic are much cleaner, and it works. Well, mostly. In order to turn it on, you need to set the key "pig.schematuple" to be true. In distributed mode, it distributes generated code via the distributed cache. I have some documentation, but will work on adding more. I think it's at a point where it could use some eyes.

          One big issue is that you currently can't group on a SchemaTuple. This is actually a known limitation, and there are comments in Pig that UDF's have to emit a Tuple. Frustratingly, even though TupleFactory as a "tupleRawComparatorClass," it only works with default Tuples, and this assumption is baked into the code. There are a couple of ways to deal with this...

          1) Change the default Tuple comparator so that it works with any sort of tuple.
          2) Make it so that Tuples can return an instance of the factory that generated them, which could then be used to get the proper comparator... or something like that. The general idea being better (and by better I mean not-nonexistent) support for custom implementations of core types.

          I think 2 is the way to go because as things are, 1 will not be easy to do and would share a lot in common with 2. 2 is nontrivial, but will open the door to the big one I'm leading up to: a bag that is schema aware. That is when the gains go from great to massive.

          Another next step is to get joins to work with this, but I want to deal with the above issues.

          Show
          Jonathan Coveney added a comment - Oh snap! An update! Indeed. The code and logic are much cleaner, and it works . Well, mostly. In order to turn it on, you need to set the key "pig.schematuple" to be true. In distributed mode, it distributes generated code via the distributed cache. I have some documentation, but will work on adding more. I think it's at a point where it could use some eyes. One big issue is that you currently can't group on a SchemaTuple. This is actually a known limitation, and there are comments in Pig that UDF's have to emit a Tuple. Frustratingly, even though TupleFactory as a "tupleRawComparatorClass," it only works with default Tuples, and this assumption is baked into the code. There are a couple of ways to deal with this... 1) Change the default Tuple comparator so that it works with any sort of tuple. 2) Make it so that Tuples can return an instance of the factory that generated them, which could then be used to get the proper comparator... or something like that. The general idea being better (and by better I mean not-nonexistent) support for custom implementations of core types. I think 2 is the way to go because as things are, 1 will not be easy to do and would share a lot in common with 2. 2 is nontrivial, but will open the door to the big one I'm leading up to: a bag that is schema aware. That is when the gains go from great to massive. Another next step is to get joins to work with this, but I want to deal with the above issues.
          Hide
          Dmitriy V. Ryaboy added a comment -

          RB?

          Show
          Dmitriy V. Ryaboy added a comment - RB?
          Hide
          Jonathan Coveney added a comment -

          Ok, nix the kvetching. With some relatively minor fixes I was able to get support for grouping! Also, I had forgotten to roll back changes to TestPigServer which I had made for personal testing.

          One day the comparator could still be much faster, but this should be at least as performant as the current implementation (probably moreso since deserialization should be faster), and you get the memory gains.

          Show
          Jonathan Coveney added a comment - Ok, nix the kvetching. With some relatively minor fixes I was able to get support for grouping! Also, I had forgotten to roll back changes to TestPigServer which I had made for personal testing. One day the comparator could still be much faster, but this should be at least as performant as the current implementation (probably moreso since deserialization should be faster), and you get the memory gains.
          Hide
          Jonathan Coveney added a comment -

          dmitriy: reviews.apache.org is down for me...has been all day :/

          Show
          Jonathan Coveney added a comment - dmitriy: reviews.apache.org is down for me...has been all day :/
          Hide
          Jonathan Coveney added a comment -

          I added the apache headers to new files.

          Show
          Jonathan Coveney added a comment - I added the apache headers to new files.
          Hide
          Jonathan Coveney added a comment -

          Guys, reviews.apache.org has been down for a while it looks like, and I don't know how long it will take for it to come back up, so I made a pull request against the apache pig github. Obviously it won't be committed via github, but it provides an easy to see and markdown diff.

          https://github.com/apache/pig/pull/3/

          Show
          Jonathan Coveney added a comment - Guys, reviews.apache.org has been down for a while it looks like, and I don't know how long it will take for it to come back up, so I made a pull request against the apache pig github. Obviously it won't be committed via github, but it provides an easy to see and markdown diff. https://github.com/apache/pig/pull/3/
          Hide
          Jonathan Coveney added a comment -

          whoops, here is a version without the whitespace changes:

          https://github.com/apache/pig/pull/4

          Show
          Jonathan Coveney added a comment - whoops, here is a version without the whitespace changes: https://github.com/apache/pig/pull/4
          Hide
          Jonathan Coveney added a comment -

          I updated the pull request with a minor change, attaching patch.

          Show
          Jonathan Coveney added a comment - I updated the pull request with a minor change, attaching patch.
          Hide
          Scott Carey added a comment -

          I can't seem to be able to view the powerpoint. On a mac, all slides display as blank. Importing it into google docs also shows them blank. I was hoping to export to pdf format from there. Can you save it as a different format? Older powerpoint formats or pdf will probably work.

          Show
          Scott Carey added a comment - I can't seem to be able to view the powerpoint. On a mac, all slides display as blank. Importing it into google docs also shows them blank. I was hoping to export to pdf format from there. Can you save it as a different format? Older powerpoint formats or pdf will probably work.
          Hide
          Jonathan Coveney added a comment -

          Scott, I attached a pdf. Note: I think the serializing times (if they're still in there) aren't representative, as I didn't use the FSDataOutputStream. I haven't bothered to redo because I know the gain is significant, and would rather focus on shipping.

          Would love your thoughts.

          Show
          Jonathan Coveney added a comment - Scott, I attached a pdf. Note: I think the serializing times (if they're still in there) aren't representative, as I didn't use the FSDataOutputStream. I haven't bothered to redo because I know the gain is significant, and would rather focus on shipping. Would love your thoughts.
          Hide
          Scott Carey added a comment -

          Thanks, I can read the pdf fine.
          Yeah, optimizing the serialization performance and size can be done in another ticket. Refactoring that helps separate that from other concerns would help people like myself contribute to that with minimal entanglement as well.
          Code gen by itself is big enough for this ticket

          Show
          Scott Carey added a comment - Thanks, I can read the pdf fine. Yeah, optimizing the serialization performance and size can be done in another ticket. Refactoring that helps separate that from other concerns would help people like myself contribute to that with minimal entanglement as well. Code gen by itself is big enough for this ticket
          Hide
          Jonathan Coveney added a comment -

          see reviewboard as well

          Show
          Jonathan Coveney added a comment - see reviewboard as well
          Hide
          Jonathan Coveney added a comment -

          Scott: totally agree.

          In general: what do we think needs to be done before this can be committed? Would be very helpful.

          Show
          Jonathan Coveney added a comment - Scott: totally agree. In general: what do we think needs to be done before this can be committed? Would be very helpful.
          Hide
          Jonathan Coveney added a comment -

          Updated patch, and I updated reviewboard.

          Show
          Jonathan Coveney added a comment - Updated patch, and I updated reviewboard.
          Hide
          Daniel Dai added a comment -

          Does not compile. Seems you miss TupleMaker.java?

          Show
          Daniel Dai added a comment - Does not compile. Seems you miss TupleMaker.java?
          Hide
          Jonathan Coveney added a comment -

          Argh. Good catch, Daniel. the currently git -> reviewboard is very frustrating, hopefully they'll respond to the INFRA patch soon! Thanks for the heads up

          Show
          Jonathan Coveney added a comment - Argh. Good catch, Daniel. the currently git -> reviewboard is very frustrating, hopefully they'll respond to the INFRA patch soon! Thanks for the heads up
          Hide
          Jonathan Coveney added a comment -

          Julien +1'd on reviewboard (didn't +1 here because JIRA has been down for people). Revision is: r1356921. I will add more documentation in a separate patch. This is TURNED OFF by default so should be invisible to existing jobs.

          Show
          Jonathan Coveney added a comment - Julien +1'd on reviewboard (didn't +1 here because JIRA has been down for people). Revision is: r1356921. I will add more documentation in a separate patch. This is TURNED OFF by default so should be invisible to existing jobs.
          Hide
          Daniel Dai added a comment -

          Thinking of several possible enhancements:
          1. Use SchemaTuple in POForEach when schema is known
          2. Combine Dmitriy's lazy tuple implementation

          Show
          Daniel Dai added a comment - Thinking of several possible enhancements: 1. Use SchemaTuple in POForEach when schema is known 2. Combine Dmitriy's lazy tuple implementation
          Hide
          Daniel Dai added a comment -

          Several unit tests are broken by this patch:
          TestEmptyInputDir
          TestFRJoin
          TestFRJoinNullValue
          TestMergeJoin
          TestNumberOfReducers
          TestPhyPatternMatch
          TestScriptLanguage

          Jonathan, do you have time to take a look?

          Show
          Daniel Dai added a comment - Several unit tests are broken by this patch: TestEmptyInputDir TestFRJoin TestFRJoinNullValue TestMergeJoin TestNumberOfReducers TestPhyPatternMatch TestScriptLanguage Jonathan, do you have time to take a look?
          Hide
          Jonathan Coveney added a comment -

          Daniel,

          1) I agree that there are a lot of great places to use this. Next on my plate is using it with LoadFuncs and Foreaches, and then ideally Bag support (which I do not think would be difficult at all, just need some time). I hadn't thought about lazy tuples – need to take a look at his code.
          2) I submitted a patch fixing the MergeJoin errors, and have time to look at the rest. Do you know if any of the others were fixed by that fix? I hate the flakiness of the full test suite, hard to know what is and isn't a false positive!

          Show
          Jonathan Coveney added a comment - Daniel, 1) I agree that there are a lot of great places to use this. Next on my plate is using it with LoadFuncs and Foreaches, and then ideally Bag support (which I do not think would be difficult at all, just need some time). I hadn't thought about lazy tuples – need to take a look at his code. 2) I submitted a patch fixing the MergeJoin errors, and have time to look at the rest. Do you know if any of the others were fixed by that fix? I hate the flakiness of the full test suite, hard to know what is and isn't a false positive!
          Hide
          Rohini Palaniswamy added a comment -

          TestSchemaTuple.java is broken for Hadoop 23/2.0.

          reader.initialize(is, new TaskAttemptContext(conf, taskId)); 
          

          should be

          reader.initialize(is, HadoopShims.createTaskAttemptContext(conf, taskId)); 
          
          Show
          Rohini Palaniswamy added a comment - TestSchemaTuple.java is broken for Hadoop 23/2.0. reader.initialize(is, new TaskAttemptContext(conf, taskId)); should be reader.initialize(is, HadoopShims.createTaskAttemptContext(conf, taskId));
          Hide
          Daniel Dai added a comment -

          Hi, Jonathan,
          Where is the fix for MergeJoin? I can run the tests to see how many it fixes.

          Show
          Daniel Dai added a comment - Hi, Jonathan, Where is the fix for MergeJoin? I can run the tests to see how many it fixes.
          Hide
          Jonathan Coveney added a comment -

          Daniel,

          I thought that the tests had already run. I checked in PIG-2806, but will check locally what's still broken and fix it.

          Rohini,

          Thanks. Easy fix, will update accordingly.

          Show
          Jonathan Coveney added a comment - Daniel, I thought that the tests had already run. I checked in PIG-2806 , but will check locally what's still broken and fix it. Rohini, Thanks. Easy fix, will update accordingly.

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development