Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3987

Unjustified cast error when performing a UNION

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 0.10.1
    • None
    • None
    • None

    Description

      I ran into a very strange issue with one of my pig scripts. I described it in this SO: http://stackoverflow.com/questions/24047572/strange-cast-error-in-pig-hadoop
      Here it is:
      I have the following script:

      br = LOAD 'cfs:///somedata';
      
      SPLIT br INTO s0 IF (sp == 1), not_s0 OTHERWISE;
      SPLIT not_s0 INTO s1 IF (adp >= 1.0), not_s1 OTHERWISE;
      SPLIT not_s1 INTO s2 IF (p > 1L), not_s2 OTHERWISE;
      SPLIT not_s2 INTO s3 IF (s > 0L), s4 OTHERWISE;
      
      tmp0 = FOREACH s0 GENERATE b, 'x' as seg;
      tmp1 = FOREACH s1 GENERATE b, 'y' as seg;
      tmp2 = FOREACH s2 GENERATE b, 'z' as seg;
      tmp3 = FOREACH s3 GENERATE b, 'w' as seg;
      tmp4 = FOREACH s4 GENERATE b, 't' as seg;
      
      out = UNION ONSCHEMA tmp0, tmp1, tmp2, tmp3, tmp4;
      
      dump out;
      

      Where the file loaded in br was generated by a previous Pig script and has an embedded schema (a .pig_schema file):

      describe br
      br: {b: chararray,p: long,afternoon: long,ddv: long,pa: long,s: long,t0002: long,t0204: long,t0406: long,t0608: long,t0810: long,t1012: long,t1214: long,t1416: long,t1618: long,t1820: long,t2022: long,t2200: long,browser_software: chararray,first_timestamp: long,last_timestamp: long,os: chararray,platform: chararray,sp: int,adp: double}
      

      Some irrelevant fields were edited from the above (I can't fully disclose the nature of the data at this time).

      The script fails with the following error:

      ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Integer cannot be cast to java.lang.Long
      

      However, dumping tmp0, tmp1, tmp2, tmp3, tmp4 works flawlessly.

      The Hadoop job tracker shows the following error 4 times:

      java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
      	at java.lang.Long.compareTo(Long.java:50)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr.doComparison(EqualToExpr.java:116)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr.getNext(EqualToExpr.java:83)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PONot.getNext(PONot.java:71)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:214)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
      	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
      	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
      	at org.apache.hadoop.mapred.Child.main(Child.java:260)
      

      I also tried this snippet (instead of the original dump):

      x = UNION s1,s2;
      y = FOREACH x GENERATE b;
      dump y;
      

      and I get a different (but I assume related) error:

      ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Double cannot be cast to java.lang.Long
      

      with the job tracker error (repeated 4 times):

      java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long
      	at java.lang.Long.compareTo(Long.java:50)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr.doComparison(GTOrEqualToExpr.java:111)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr.getNext(GTOrEqualToExpr.java:78)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PONot.getNext(PONot.java:71)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:233)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:141)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
      	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
      	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
      	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
      	at org.apache.hadoop.mapred.Child.main(Child.java:260)
      

      I don't think I have a data quality issue. I successfully ran the following snippet (taking it from after the definition of s0, s1, ...):

      tmp0 = FOREACH s0 GENERATE *, 'x' as seg;
      tmp1 = FOREACH s1 GENERATE *, 'y' as seg;
      tmp2 = FOREACH s2 GENERATE *, 'z' as seg;
      tmp3 = FOREACH s3 GENERATE *, 'w' as seg;
      tmp4 = FOREACH s4 GENERATE *, 't' as seg;
      
      br_seg = UNION ONSCHEMA tmp0, tmp1, tmp2, tmp3, tmp4;
      
      breakdown = FOREACH( GROUP br_seg BY seg ){
        ddb = FILTER br_seg BY (ddv > 0L);
        desktop = FILTER br_seg BY (platform == 'd');
        mobile = FILTER br_seg BY (platform == 'm');
        p_br = FILTER br_seg BY (sp == 1);
        tablet = FILTER br_seg BY (platform == 't');
        GENERATE group as seg,
          COUNT(br_seg) as br,
          SUM(br_seg.p) as p,
          COUNT(ddb) as ddb,
          COUNT(desktop) as desktop,
          COUNT(mobile) as mobile,
          COUNT(p_br) as p_br,
          COUNT(tablet) as tablet,
          SUM(br_seg.ddv) as ddv,
          SUM(br_seg.pa) as pa,
          SUM(br_seg.t0002) as t0002,
          SUM(br_seg.t0204) as t0204,
          SUM(br_seg.t0406) as t0406,
          SUM(br_seg.t0608) as t0608,
          SUM(br_seg.t0810) as t0810,
          SUM(br_seg.t1012) as t1012,
          SUM(br_seg.t1214) as t1214,
          SUM(br_seg.t1416) as t1416,
          SUM(br_seg.t1618) as t1618,
          SUM(br_seg.t1820) as t1820,
          SUM(br_seg.t2022) as t2022,
          SUM(br_seg.t2200) as t2200;
      }
      dump breakdown
      

      And got as output:

      (t,43,43,0,30,7,0,6,0,0,2,5,9,3,2,3,1,4,1,4,4,5)
      (w,17,17,0,10,3,0,4,0,0,1,1,1,0,1,2,1,6,1,1,1,1)
      (x,17,243,0,12,2,17,3,0,243,1,6,4,9,20,55,40,37,21,23,8,19)
      (y,17,108,0,14,2,0,1,0,0,7,3,5,5,3,11,29,4,16,6,13,6)
      (z,6,12,0,4,1,0,1,0,0,0,0,0,2,3,1,0,4,2,0,0,0)
      

      Is this a known bug or a new one? Is there a work around?

      Attachments

        1. .pig_header
          0.2 kB
          Giovanni Botta
        2. .pig_schema
          2 kB
          Giovanni Botta
        3. part-r-00000
          14 kB
          Giovanni Botta

        Activity

          People

            Unassigned Unassigned
            giovannibotta Giovanni Botta
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: