Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.10.0
    • Component/s: data
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Introduce boolean as first class Pig data type. You can use "boolean" anywhere Pig expecting a data type. For example:
      a = load 'input' as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);
      b = foreach a generate a0, a1, (boolean)a2;
      c = group b by a2; -- group by a boolean field

      When UTF8StorageConvert converts bytes into boolean, it expects "true" (ignore case) to be TRUE and "false" (ignore case) to be FALSE, otherwise, we get null. For example:
      a = load 'input' as (a0:boolean);

      input file:
      true -- we get TRUE
      True -- we get TRUE
      FALSE -- we get FALSE
      1 -- we get null

      We also change the interface LoadCaster/StoreCaster to include boolean type.
      Show
      Introduce boolean as first class Pig data type. You can use "boolean" anywhere Pig expecting a data type. For example: a = load 'input' as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2); b = foreach a generate a0, a1, (boolean)a2; c = group b by a2; -- group by a boolean field When UTF8StorageConvert converts bytes into boolean, it expects "true" (ignore case) to be TRUE and "false" (ignore case) to be FALSE, otherwise, we get null. For example: a = load 'input' as (a0:boolean); input file: true -- we get TRUE True -- we get TRUE FALSE -- we get FALSE 1 -- we get null We also change the interface LoadCaster/StoreCaster to include boolean type.
    • Tags:
      boolean type pig

      Description

      Pig needs a Boolean data type. Pig-1097 is dependent on doing this.

      I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ plus unit tests to make this work?

      This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

      1. PIG-1429_1.patch
        25 kB
        Zhijie Shen
      2. PIG-1429_2.patch
        39 kB
        Zhijie Shen
      3. PIG-1429_3.patch
        209 kB
        Zhijie Shen
      4. PIG-1429_4.patch
        122 kB
        Zhijie Shen
      5. PIG-1429_5.patch
        95 kB
        Zhijie Shen
      6. PIG-1429_6.patch
        96 kB
        Daniel Dai
      7. working_boolean.patch
        23 kB
        Russell Jurney

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          Have at it, Sir.

          Show
          Dmitriy V. Ryaboy added a comment - Have at it, Sir.
          Hide
          Russell Jurney added a comment -

          Broken patch that adds boolean type.

          Show
          Russell Jurney added a comment - Broken patch that adds boolean type.
          Hide
          Russell Jurney added a comment -

          Did the work I think is required based on Alan's comments in PIG-1314 and help from Dmitriy. It builds - I still have to add tests (may be the only way to fix this), but I'm hoping someone can ID my problem. I keep getting the exception below. Anyone know where I should look? I've traced this through, and nothing stands out.

          -------------

          org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan.
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
          at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
          2010-05-29 20:04:25,363 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001
          2010-05-29 20:04:29,866 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
          2010-05-29 20:04:29,866 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed!
          2010-05-29 20:04:29,868 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: "file:/tmp/temp-537038699/tmp-381529216"
          2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : Unable to determine number of records written
          2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : Unable to determine number of bytes written
          2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Spillable Memory Manager spill count : 0
          2010-05-29 20:04:29,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Proactive spill count : 0
          2010-05-29 20:04:29,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
          2010-05-29 20:04:29,872 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
          2010-05-29 20:04:29,876 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A
          2010-05-29 20:04:29,876 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A
          at org.apache.pig.PigServer.openIterator(PigServer.java:663)
          at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:598)
          at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
          at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
          at org.apache.pig.Main.main(Main.java:410)
          Caused by: java.io.IOException: Job terminated with anomalous status FAILED
          at org.apache.pig.PigServer.openIterator(PigServer.java:657)
          ... 6 more

          Show
          Russell Jurney added a comment - Did the work I think is required based on Alan's comments in PIG-1314 and help from Dmitriy. It builds - I still have to add tests (may be the only way to fix this), but I'm hoping someone can ID my problem. I keep getting the exception below. Anyone know where I should look? I've traced this through, and nothing stands out. ------------- org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2010-05-29 20:04:25,363 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001 2010-05-29 20:04:29,866 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2010-05-29 20:04:29,866 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2010-05-29 20:04:29,868 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: "file:/tmp/temp-537038699/tmp-381529216" 2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : Unable to determine number of records written 2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : Unable to determine number of bytes written 2010-05-29 20:04:29,868 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Spillable Memory Manager spill count : 0 2010-05-29 20:04:29,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Proactive spill count : 0 2010-05-29 20:04:29,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2010-05-29 20:04:29,872 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2010-05-29 20:04:29,876 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A 2010-05-29 20:04:29,876 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:663) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:598) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) at org.apache.pig.Main.main(Main.java:410) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:657) ... 6 more
          Hide
          Russell Jurney added a comment -

          Did some more work, have a new patch... seems the problem is in PigMapBase.runPipeline:

          protected void runPipeline(PhysicalOperator leaf) throws IOException, InterruptedException {
          while(true){
          String foo = ""; String bar = "";
          Result res = leaf.getNext(DUMMYTUPLE);

          res is NULL, so it dies.

          The leaf is: (Name: A: New For Each(false,false)[bag] - 1-13 Operator Key: 1-13)

          Show
          Russell Jurney added a comment - Did some more work, have a new patch... seems the problem is in PigMapBase.runPipeline: protected void runPipeline(PhysicalOperator leaf) throws IOException, InterruptedException { while(true){ String foo = ""; String bar = ""; Result res = leaf.getNext(DUMMYTUPLE); res is NULL, so it dies. The leaf is: (Name: A: New For Each(false,false) [bag] - 1-13 Operator Key: 1-13)
          Hide
          Russell Jurney added a comment -

          Attached patch can LOAD/DUMP a boolean type I'll work on more tests, but it roughly works.

          Show
          Russell Jurney added a comment - Attached patch can LOAD/DUMP a boolean type I'll work on more tests, but it roughly works.
          Hide
          Russell Jurney added a comment -

          Some more work to be done with operators.

          Show
          Russell Jurney added a comment - Some more work to be done with operators.
          Hide
          Alan Gates added a comment -

          Is this patch ready for review or does it need more work?

          Show
          Alan Gates added a comment - Is this patch ready for review or does it need more work?
          Hide
          Russell Jurney added a comment -

          The patch needs more work. Should knock it out in the next couple weeks.

          Show
          Russell Jurney added a comment - The patch needs more work. Should knock it out in the next couple weeks.
          Hide
          Russell Jurney added a comment -

          I'll be able to wrap this up next weekend.

          Show
          Russell Jurney added a comment - I'll be able to wrap this up next weekend.
          Hide
          Olga Natkovich added a comment -

          Hi Russel, are you still planning to wrap this up before Pig 0.8.0 release?

          Show
          Olga Natkovich added a comment - Hi Russel, are you still planning to wrap this up before Pig 0.8.0 release?
          Hide
          Olga Natkovich added a comment -

          Unlinking because we are branching for release today

          Show
          Olga Natkovich added a comment - Unlinking because we are branching for release today
          Hide
          Daniel Dai added a comment -

          Hi, Russell, are you still working on this issue?

          Show
          Daniel Dai added a comment - Hi, Russell, are you still working on this issue?
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi,
          I was considering applying for this project.
          From what I know of Pig's codebase, the refactoring effort would be large.
          There will be changes throughout the whole code as a new datatype is introduced and most of the code is based on a switch pattern.

          Is the code produced up to know still synced with trunk or should the patch be started over?

          Maybe the original author of the patch is a better candidate for this project

          Show
          Gianmarco De Francisci Morales added a comment - Hi, I was considering applying for this project. From what I know of Pig's codebase, the refactoring effort would be large. There will be changes throughout the whole code as a new datatype is introduced and most of the code is based on a switch pattern. Is the code produced up to know still synced with trunk or should the patch be started over? Maybe the original author of the patch is a better candidate for this project
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          We didn't hear back from original author for a while, so we assume he will not be able to continue working on this issue. That's why we put it as a GSOC project.

          As for the refactoring, IMHO, I don't see a major refactoring point, but there are code changes all around places. The hard part is not missing a place, rather than a complete refactoring.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, We didn't hear back from original author for a while, so we assume he will not be able to continue working on this issue. That's why we put it as a GSOC project. As for the refactoring, IMHO, I don't see a major refactoring point, but there are code changes all around places. The hard part is not missing a place, rather than a complete refactoring.
          Hide
          Zhijie Shen added a comment -

          Does anyone have the opinion of casting DataByteArray to Boolean?

          1. DataByteArray can represent a numeric value, such that non-zero value should be converted to True.
          2. DataByteArray can also represent a string, such that the "true" string should be converted to True.

          However, these two cases conflicts to some extent. a raw DataByteArray can be simultaneously translated into a non-zero numeric or a non-"true" string. Then, it is hard to say whether it should be converted to True or False.

          Show
          Zhijie Shen added a comment - Does anyone have the opinion of casting DataByteArray to Boolean? 1. DataByteArray can represent a numeric value, such that non-zero value should be converted to True. 2. DataByteArray can also represent a string, such that the "true" string should be converted to True. However, these two cases conflicts to some extent. a raw DataByteArray can be simultaneously translated into a non-zero numeric or a non-"true" string. Then, it is hard to say whether it should be converted to True or False.
          Hide
          Daniel Dai added a comment -

          I would vote for string "true"/"false"(regardless case), otherwise null, for Utf8StorageConverter.

          Show
          Daniel Dai added a comment - I would vote for string "true"/"false"(regardless case), otherwise null, for Utf8StorageConverter.
          Hide
          Zhijie Shen added a comment -

          So how about treating DataByteArray as a string, and using Boolean.valueOf(String s) of JDK to decide True/False.

          Show
          Zhijie Shen added a comment - So how about treating DataByteArray as a string, and using Boolean.valueOf(String s) of JDK to decide True/False.
          Hide
          Dmitriy V. Ryaboy added a comment -

          We already have a public bytesToBoolean method in Utf8StorageConverter and should keep its behavior (which is precisely this). In fact, looks like DataReaderWriter (used by BinStorage) already supports booleans, also as strings.

          This does mean we are being really inefficient in storing booleans in our binary format. BinInterSedes also does this, even though it could be flipping to INTEGER_0. This would be preferable, I think, at least for the internal data formats.

          Show
          Dmitriy V. Ryaboy added a comment - We already have a public bytesToBoolean method in Utf8StorageConverter and should keep its behavior (which is precisely this). In fact, looks like DataReaderWriter (used by BinStorage) already supports booleans, also as strings. This does mean we are being really inefficient in storing booleans in our binary format. BinInterSedes also does this, even though it could be flipping to INTEGER_0. This would be preferable, I think, at least for the internal data formats.
          Hide
          Zhijie Shen added a comment -

          If we decide to treat bytes as string and then convert to a boolean value, we should convert a boolean value to the bytes of "true" and "false" on the other way. Otherwise, the casting is not reversible. Do you agree this? Though the boolean storage seems not efficient.

          Show
          Zhijie Shen added a comment - If we decide to treat bytes as string and then convert to a boolean value, we should convert a boolean value to the bytes of "true" and "false" on the other way. Otherwise, the casting is not reversible. Do you agree this? Though the boolean storage seems not efficient.
          Hide
          Zhijie Shen added a comment -

          Attached is the temporal non-working patch to update my progress. I've added boolean cast to many places in the runtime code. I still need to check I've touched every place. The test code hasn't been well explored yet. The following is the type casting rule:

          1. From Boolean to other types

          Integer Long Float Double DataByteArray String
          True 1 1L 1.0F 1.0D "true" "true"
          False 0 0L 0.0F 0.0D "false" "false"

          2. From other types to Boolean

          Integer non-0 -> True 0 -> False
          Long non-0L -> True 0L -> False
          Float non-0.0F -> True 0.0F -> False
          Double non-0.0D -> True 0.0D -> False
          DataByteArray "true" > True non"true" -> False
          String "true" > True non"true" -> False

          I still have some puzzles with the code:

          1. In PigPerformanceLoader.Caster.bytesToBag(byte[] b, ResourceFieldSchema fs), there is a piece of code as follows

          switch (b[start])

          { case 105: t.set(0, bytesToInteger(copy)); break; case 108: t.set(0, bytesToLong(copy)); break; case 102: t.set(0, bytesToFloat(copy)); break; case 100: t.set(0, bytesToDouble(copy)); break; case 115: t.set(0, bytesToCharArray(copy)); break; case 109: t.set(0, bytesToMap(copy)); break; case 98: t.set(0, bytesToBag(copy, null)); break; default: throw new RuntimeException("unknown type " + b[start]); }

          Where does the number come? What should I assign to Boolean?

          2. Similarly, in QueryLexer.tokens, what is the rule to assigne a value to a token? Again what should be assigned to Boolean?

          Show
          Zhijie Shen added a comment - Attached is the temporal non-working patch to update my progress. I've added boolean cast to many places in the runtime code. I still need to check I've touched every place. The test code hasn't been well explored yet. The following is the type casting rule: 1. From Boolean to other types Integer Long Float Double DataByteArray String True 1 1L 1.0F 1.0D "true" "true" False 0 0L 0.0F 0.0D "false" "false" 2. From other types to Boolean Integer non-0 -> True 0 -> False Long non-0L -> True 0L -> False Float non-0.0F -> True 0.0F -> False Double non-0.0D -> True 0.0D -> False DataByteArray "true" > True non "true" -> False String "true" > True non "true" -> False I still have some puzzles with the code: 1. In PigPerformanceLoader.Caster.bytesToBag(byte[] b, ResourceFieldSchema fs), there is a piece of code as follows switch (b [start] ) { case 105: t.set(0, bytesToInteger(copy)); break; case 108: t.set(0, bytesToLong(copy)); break; case 102: t.set(0, bytesToFloat(copy)); break; case 100: t.set(0, bytesToDouble(copy)); break; case 115: t.set(0, bytesToCharArray(copy)); break; case 109: t.set(0, bytesToMap(copy)); break; case 98: t.set(0, bytesToBag(copy, null)); break; default: throw new RuntimeException("unknown type " + b[start]); } Where does the number come? What should I assign to Boolean? 2. Similarly, in QueryLexer.tokens, what is the rule to assigne a value to a token? Again what should be assigned to Boolean?
          Hide
          Gianmarco De Francisci Morales added a comment -

          2. Similarly, in QueryLexer.tokens, what is the rule to assigne a value to a token? Again what should be assigned to Boolean?

          Hi Zhijie,

          QueryLexer.tokens is generated automatically by ANTLR, so you should not touch it.
          What you should do is to modify QueryLexer.g in order to include a BOOLEAN token declaration (like INTEGER or FLOAT).
          Also, you should modify QueryParser.g and all the other grammar (.g) files to accept boolean expressions.

          Show
          Gianmarco De Francisci Morales added a comment - 2. Similarly, in QueryLexer.tokens, what is the rule to assigne a value to a token? Again what should be assigned to Boolean? Hi Zhijie, QueryLexer.tokens is generated automatically by ANTLR, so you should not touch it. What you should do is to modify QueryLexer.g in order to include a BOOLEAN token declaration (like INTEGER or FLOAT). Also, you should modify QueryParser.g and all the other grammar (.g) files to accept boolean expressions.
          Hide
          Zhijie Shen added a comment -

          The boolean type nearly works with the newest patch, though I still have a number of issues to check and test cases to add.

          In this patch, I've modified the parser so that boolean expression is recognized in the pig command statements (including the constant value "ture" and "false"). Meanwhile, loading or storing boolean value into storage is supported as well. In addition, I modified the casting rules so that other types can be casted to boolean and boolean can be casted to other types.

          Since the code related to data type is widely spread in the project. I need further investigation to ensure there is no missing part.

          Show
          Zhijie Shen added a comment - The boolean type nearly works with the newest patch, though I still have a number of issues to check and test cases to add. In this patch, I've modified the parser so that boolean expression is recognized in the pig command statements (including the constant value "ture" and "false"). Meanwhile, loading or storing boolean value into storage is supported as well. In addition, I modified the casting rules so that other types can be casted to boolean and boolean can be casted to other types. Since the code related to data type is widely spread in the project. I need further investigation to ensure there is no missing part.
          Hide
          Daniel Dai added a comment -

          Hi, Zhijie, patch in generally looks good. I tested load simple boolean data, load nested boolean data, cast into boolean, cast into complex boolean, all works. Here is my review comments:
          1. Conversion from string/bytearray -> boolean, only "true" ignore case will return true, otherwise return false. This is the behavior of Boolean.valueOf. I am fine with it, not sure how others in community think of
          2. Conversion from number -> boolean. Actually I'd like to disable it. If we do allow it, we need to do consistently. You did it in caster, but TypeCheckingExpVisitor does not handle it.
          3. POCast.java:382, "Note: NOT part of the LoadCaster interface", byteToBoolean is in interface in patch
          4. CastUtil.java: 195, why do we catch NumberFormatException?
          5. POCast.java:169, typo:"Cannot determine how to convert the bytearray to int"
          6. Tests is not added, right?

          Show
          Daniel Dai added a comment - Hi, Zhijie, patch in generally looks good. I tested load simple boolean data, load nested boolean data, cast into boolean, cast into complex boolean, all works. Here is my review comments: 1. Conversion from string/bytearray -> boolean, only "true" ignore case will return true, otherwise return false. This is the behavior of Boolean.valueOf. I am fine with it, not sure how others in community think of 2. Conversion from number -> boolean. Actually I'd like to disable it. If we do allow it, we need to do consistently. You did it in caster, but TypeCheckingExpVisitor does not handle it. 3. POCast.java:382, "Note: NOT part of the LoadCaster interface", byteToBoolean is in interface in patch 4. CastUtil.java: 195, why do we catch NumberFormatException? 5. POCast.java:169, typo:"Cannot determine how to convert the bytearray to int" 6. Tests is not added, right?
          Hide
          Zhijie Shen added a comment -

          Hi Daniel, thanks for your review. I'm adding the test cases, which should be available soon.

          For the second comment, I also hesitate to support the conversion between boolean and number, because the primitive types of Java doesn't support such kind of conversion. Temporally, I just followed the choice of the reporter of this issue. I'd like to hear the community's opinions.

          For the third comment, would you please explain it again? It seems not very clear to me. Thanks!

          Show
          Zhijie Shen added a comment - Hi Daniel, thanks for your review. I'm adding the test cases, which should be available soon. For the second comment, I also hesitate to support the conversion between boolean and number, because the primitive types of Java doesn't support such kind of conversion. Temporally, I just followed the choice of the reporter of this issue. I'd like to hear the community's opinions. For the third comment, would you please explain it again? It seems not very clear to me. Thanks!
          Hide
          Zhijie Shen added a comment -

          I just found there is one more issue to deal with: sort by a boolean column.

          Show
          Zhijie Shen added a comment - I just found there is one more issue to deal with: sort by a boolean column.
          Hide
          Daniel Dai added a comment -

          Yes, you are right, you will need to add PigBooleanArrayWritableComparator, PigGroupingBooleanWritableComparator, PigBooleanRawComparator, NullableIntWritable. Fortunately raw tuple comparator (BinInterSedesTupleRawComparator, DefaultTupleRawComparator) already handle boolean. You will need to add test case to test order by boolean, group by boolean, join by boolean.

          Show
          Daniel Dai added a comment - Yes, you are right, you will need to add PigBooleanArrayWritableComparator, PigGroupingBooleanWritableComparator, PigBooleanRawComparator, NullableIntWritable. Fortunately raw tuple comparator (BinInterSedesTupleRawComparator, DefaultTupleRawComparator) already handle boolean. You will need to add test case to test order by boolean, group by boolean, join by boolean.
          Hide
          Thejas M Nair added a comment -

          My opinion on what should be the behavior for casts to/from boolean -

          • explicit cast from numeric types and chararray to boolean and vice versa should be allowed
          • implicit conversions from boolean to other types in arithmetic/boolean expressions should not be done. This is similar to the behavior in case of chararray.
          • The cast from a string (bytearray/chararray) to boolean should be false only if the string is 'false'. If the value is not 'true' or 'false' (case insensitive), I think it is better to return null, and increment the counter that keeps track of conversion errors.
          Show
          Thejas M Nair added a comment - My opinion on what should be the behavior for casts to/from boolean - explicit cast from numeric types and chararray to boolean and vice versa should be allowed implicit conversions from boolean to other types in arithmetic/boolean expressions should not be done. This is similar to the behavior in case of chararray. The cast from a string (bytearray/chararray) to boolean should be false only if the string is 'false'. If the value is not 'true' or 'false' (case insensitive), I think it is better to return null, and increment the counter that keeps track of conversion errors.
          Hide
          Olga Natkovich added a comment -

          Hi Thejas,

          Your first point makes sense. I am not clear why you want to disallow conversion the other way. BTW strings allow conversions to numeric both ways: http://pig.apache.org/docs/r0.9.0/basic.html#cast

          Show
          Olga Natkovich added a comment - Hi Thejas, Your first point makes sense. I am not clear why you want to disallow conversion the other way. BTW strings allow conversions to numeric both ways: http://pig.apache.org/docs/r0.9.0/basic.html#cast
          Hide
          Thejas M Nair added a comment -

          Olga,
          I was referring to only implicit conversions. I meant to say that implicit conversions should be not be done for boolean both ways. Implicit conversions are not done for strings.

          Show
          Thejas M Nair added a comment - Olga, I was referring to only implicit conversions. I meant to say that implicit conversions should be not be done for boolean both ways. Implicit conversions are not done for strings.
          Hide
          Olga Natkovich added a comment -

          Oh, I missed that one was explicit and one was implicit - I was looking for symmetry

          Show
          Olga Natkovich added a comment - Oh, I missed that one was explicit and one was implicit - I was looking for symmetry
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          I agree with you on the first two points. For the third one, since conversion from number to boolean is a surjective function (Z ->

          {true, false}, R -> {true, false}

          ), would it be better to construct a surjective function (

          {all strings}

          ->

          {true, false}

          ) as well to ensure consistency?

          Usually, the string that is neither "true" nor "false" is meaningless to be converted into boolean, but it is not necessarily a error. User applications should know the context better, thus how about leaving this issue to them? If they want to use string to represent boolean type, they can only use "true" and "false" to avoid ambiguity. And Pig just follows the same behavior Boolean.valueOf().

          Show
          Zhijie Shen added a comment - Hi Thejas, I agree with you on the first two points. For the third one, since conversion from number to boolean is a surjective function (Z -> {true, false}, R -> {true, false} ), would it be better to construct a surjective function ( {all strings} -> {true, false} ) as well to ensure consistency? Usually, the string that is neither "true" nor "false" is meaningless to be converted into boolean, but it is not necessarily a error. User applications should know the context better, thus how about leaving this issue to them? If they want to use string to represent boolean type, they can only use "true" and "false" to avoid ambiguity. And Pig just follows the same behavior Boolean.valueOf().
          Hide
          Thejas M Nair added a comment -

          Zhijie,
          I doubt if there are going to be use cases where users actually want a string like 'xyz' to be converted to false. I think it is more likely to happen because of user error such as applying the cast on a wrong column. Converting it to null and incrementing the conversion error counter will help the user find such an error sooner.
          I don't think the consistency with numeric conversion is as important because the logic for numeric conversion anyway different.

          Show
          Thejas M Nair added a comment - Zhijie, I doubt if there are going to be use cases where users actually want a string like 'xyz' to be converted to false. I think it is more likely to happen because of user error such as applying the cast on a wrong column. Converting it to null and incrementing the conversion error counter will help the user find such an error sooner. I don't think the consistency with numeric conversion is as important because the logic for numeric conversion anyway different.
          Hide
          Daniel Dai added a comment -

          Hi, Zhijie,
          Discussed with Thejas, here is the things we agree on:
          1. For string->boolean: we'd like "true"=>TRUE, "false"=>FALSE (ignore case), otherwise null
          2. For numeric type->boolean: zero=>FALSE, non-zero=>TRUE
          3. We only do string<=>boolean, numerical<=>boolean in explicit cast (POCast) and LoadStoreCaster (UTF8StoreageConverter). We don't do implicit cast for this (no need extensive change in TypeCheckingExpVisitor)

          Show
          Daniel Dai added a comment - Hi, Zhijie, Discussed with Thejas, here is the things we agree on: 1. For string->boolean: we'd like "true"=>TRUE, "false"=>FALSE (ignore case), otherwise null 2. For numeric type->boolean: zero=>FALSE, non-zero=>TRUE 3. We only do string<=>boolean, numerical<=>boolean in explicit cast (POCast) and LoadStoreCaster (UTF8StoreageConverter). We don't do implicit cast for this (no need extensive change in TypeCheckingExpVisitor)
          Hide
          Zhijie Shen added a comment -

          Hi,

          The newly attached patch is ready for submission.

          In this patch, I've updated the logic of converting String to Boolean according to Thejas's and Daniel's comments, added boolean comparator for sorting by boolean column, unified the type name of Boolean to "bool", and modified some other minor issues. Additionally, I've added the test cases.

          Please have me to verify the patch. Thanks indeed!

          Show
          Zhijie Shen added a comment - Hi, The newly attached patch is ready for submission. In this patch, I've updated the logic of converting String to Boolean according to Thejas's and Daniel's comments, added boolean comparator for sorting by boolean column, unified the type name of Boolean to "bool", and modified some other minor issues. Additionally, I've added the test cases. Please have me to verify the patch. Thanks indeed!
          Hide
          Daniel Dai added a comment -

          I was impressed by the completeness of the patch. Good job, Zhijie! The comments before were properly addressed. There are several more comments:
          1. package test.org.apache.pig.test.utils.dotGraph.parser is generated by javacc, it should not be included in patch

          2. I saw in golden files, we already have boolean type in some of the nodes, except we are using "boolean" instead of "bool". I wonder if we need to change the name. We can use "boolean" instead of "bool" as the name, thus no backward compatibility issue

          3. Utf8StorageConverter.java:359, the comments here is no longer applicable

          4. I don't understand the change in contrib/zebra/src/java/org/apache/hadoop/zebra/schema/SchemaParser.jjt, why we add "t = <>"?

          5. I don't understand some of the test case you change, if you can give a brief description each testcase you changed (like TestTypeCheckingValidatorNewLP:test implicit cast, TestConversions:test bytes<=>boolean conversion), that would help. Also I don't see any end-to-end tests(something similar to TestForEachNestedPlanLocal in your nested cross patch). We shall add the following end-to-end test: read boolean input, read a tuple containing boolean type, group/order by boolean

          Show
          Daniel Dai added a comment - I was impressed by the completeness of the patch. Good job, Zhijie! The comments before were properly addressed. There are several more comments: 1. package test.org.apache.pig.test.utils.dotGraph.parser is generated by javacc, it should not be included in patch 2. I saw in golden files, we already have boolean type in some of the nodes, except we are using "boolean" instead of "bool". I wonder if we need to change the name. We can use "boolean" instead of "bool" as the name, thus no backward compatibility issue 3. Utf8StorageConverter.java:359, the comments here is no longer applicable 4. I don't understand the change in contrib/zebra/src/java/org/apache/hadoop/zebra/schema/SchemaParser.jjt, why we add "t = <>"? 5. I don't understand some of the test case you change, if you can give a brief description each testcase you changed (like TestTypeCheckingValidatorNewLP:test implicit cast, TestConversions:test bytes<=>boolean conversion), that would help. Also I don't see any end-to-end tests(something similar to TestForEachNestedPlanLocal in your nested cross patch). We shall add the following end-to-end test: read boolean input, read a tuple containing boolean type, group/order by boolean
          Hide
          Zhijie Shen added a comment -

          Hi Daniel,

          Thanks for your review, I've fixed 1, 3, 4. For 5, I'll add the comments and some end-to-end tests. For 2, if we want to stick to "boolean" as the name of Boolean Type, we'd better revise the grammar: changing from "bool" keyword to "boolean". Otherwise, Utils.getSchemaFromString() will be broken if the supplied schema string uses "bool". And the name used in Pig Latin commands should be consistent to that in the displayed plan/schema.

          Show
          Zhijie Shen added a comment - Hi Daniel, Thanks for your review, I've fixed 1, 3, 4. For 5, I'll add the comments and some end-to-end tests. For 2, if we want to stick to "boolean" as the name of Boolean Type, we'd better revise the grammar: changing from "bool" keyword to "boolean". Otherwise, Utils.getSchemaFromString() will be broken if the supplied schema string uses "bool". And the name used in Pig Latin commands should be consistent to that in the displayed plan/schema.
          Hide
          Daniel Dai added a comment -

          You are right. But I mean change the keyword to "boolean" as well.

          Show
          Daniel Dai added a comment - You are right. But I mean change the keyword to "boolean" as well.
          Hide
          Zhijie Shen added a comment -

          Hi Daniel,

          Attached is the latest version of the patch. I've tried to fix most of the your comments. In this patch, I've changed and unified the name of Boolean type to "boolean" instead of "bool". I've added some brief comments for some test cases that I created. Additionally, I've added an end-to-end cast to show sorting by the boolean value column. However, I found that two other mentioned end-to-end cases have already been part of TestDataModel. Please have a look to see whether it is ready for submit.

          Show
          Zhijie Shen added a comment - Hi Daniel, Attached is the latest version of the patch. I've tried to fix most of the your comments. In this patch, I've changed and unified the name of Boolean type to "boolean" instead of "bool". I've added some brief comments for some test cases that I created. Additionally, I've added an end-to-end cast to show sorting by the boolean value column. However, I found that two other mentioned end-to-end cases have already been part of TestDataModel. Please have a look to see whether it is ready for submit.
          Hide
          Zhijie Shen added a comment -

          I've polished the patch by eliminating unnecessary changes and excluding the unrelated files.

          Show
          Zhijie Shen added a comment - I've polished the patch by eliminating unnecessary changes and excluding the unrelated files.
          Hide
          Daniel Dai added a comment -

          PIG-1429_6.patch include some minor change in parser, and fix issues in test-patch.

          Show
          Daniel Dai added a comment - PIG-1429 _6.patch include some minor change in parser, and fix issues in test-patch.
          Hide
          Daniel Dai added a comment -

          test-patch result:
          [exec]
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 45 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] -1 javac. The applied patch generated 968 javac compiler warnings (more than the trunk's current 954 warnings).
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 444 release audit warnings (more than the trunk's current 431 warnings).

          All new javac warnings are in parser generated code. The only new file PigBooleanRawComparator.java has the proper header.

          Patch committed to trunk. Great job, Zhijie!

          Show
          Daniel Dai added a comment - test-patch result: [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 45 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 968 javac compiler warnings (more than the trunk's current 954 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 444 release audit warnings (more than the trunk's current 431 warnings). All new javac warnings are in parser generated code. The only new file PigBooleanRawComparator.java has the proper header. Patch committed to trunk. Great job, Zhijie!
          Hide
          Zhijie Shen added a comment -

          Thanks for your help, Daniel!

          Show
          Zhijie Shen added a comment - Thanks for your help, Daniel!

            People

            • Assignee:
              Zhijie Shen
              Reporter:
              Russell Jurney
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 8h
                8h
                Remaining:
                Remaining Estimate - 8h
                8h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development