Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16845

org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0
    • Component/s: SQL
    • Labels:
      None
    • Flags:
      Important

      Description

      I have a wide table(400 columns), when I try fitting the traindata on all columns, the fatal error occurs.

      ... 46 more
      Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
      at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
      at org.codehaus.janino.CodeContext.write(CodeContext.java:854)

        Issue Links

          Activity

          Hide
          jungd David Jung added a comment -

          In addition to receiving this error when attempting to call pyspark.ml.regression.RandomForestRegressor.fit() on a DataFrame with 700+ columns, we also see it when just calling DataFrame.show() on the same wide DataFrame. Our modeling requires many thousands of features, so this is a blocking issue for mllib for us.
          I'll see if digging into it further can shed any further light on it.

          Show
          jungd David Jung added a comment - In addition to receiving this error when attempting to call pyspark.ml.regression.RandomForestRegressor.fit() on a DataFrame with 700+ columns, we also see it when just calling DataFrame.show() on the same wide DataFrame. Our modeling requires many thousands of features, so this is a blocking issue for mllib for us. I'll see if digging into it further can shed any further light on it.
          Hide
          srowen Sean Owen added a comment -

          Reversing myself: very similar error but not the same site exactly. There is reason to believe it's not the same.

          Show
          srowen Sean Owen added a comment - Reversing myself: very similar error but not the same site exactly. There is reason to believe it's not the same.
          Hide
          dondrake Don Drake added a comment -

          I just hit this bug as well. Are there any suggested workarounds?

          Show
          dondrake Don Drake added a comment - I just hit this bug as well. Are there any suggested workarounds?
          Hide
          srowen Sean Owen added a comment -

          Q: does setting spark.sql.codegen.wholeStage=false work around this?

          Show
          srowen Sean Owen added a comment - Q: does setting spark.sql.codegen.wholeStage=false work around this?
          Hide
          dondrake Don Drake added a comment -

          Unfortunately, it does not work around it.

          16/10/10 18:19:47 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
          /* 001 */ public java.lang.Object generate(Object[] references)

          { /* 002 */ return new SpecificUnsafeProjection(references); /* 003 */ }
          Show
          dondrake Don Drake added a comment - Unfortunately, it does not work around it. 16/10/10 18:19:47 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificUnsafeProjection(references); /* 003 */ }
          Hide
          Utsumi K added a comment -

          We manually wrote parts that were throwing errors (StringIndexer and FeatureAssembler) in RDD and converted to DataFrame to run RandomForestClassifier.

          Show
          Utsumi K added a comment - We manually wrote parts that were throwing errors (StringIndexer and FeatureAssembler) in RDD and converted to DataFrame to run RandomForestClassifier.
          Hide
          lwlin Liwei Lin added a comment - - edited

          Don Drake K Could you provide a simple reproducer?

          Show
          lwlin Liwei Lin added a comment - - edited Don Drake K Could you provide a simple reproducer?
          Hide
          dondrake Don Drake added a comment -

          I can't at the moment, mine is not simple.

          But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092

          Show
          dondrake Don Drake added a comment - I can't at the moment, mine is not simple. But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092
          Hide
          lwlin Liwei Lin added a comment -

          Thanks for the pointer, let me look into this.

          Show
          lwlin Liwei Lin added a comment - Thanks for the pointer, let me look into this.
          Hide
          Utsumi K added a comment -

          Code and data are also here as well.
          https://issues.apache.org/jira/browse/SPARK-17223
          Thanks for looking into it!

          Show
          Utsumi K added a comment - Code and data are also here as well. https://issues.apache.org/jira/browse/SPARK-17223 Thanks for looking into it!
          Hide
          kiszk Kazuaki Ishizaki added a comment -

          Thank you for preparing the case. I noticed that the following small code can reproduce the same exception.

          Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
          
          val sortOrder = Literal("abc").asc
          GenerateOrdering.generate(Array.fill(450)(sortOrder))
          
          Show
          kiszk Kazuaki Ishizaki added a comment - Thank you for preparing the case. I noticed that the following small code can reproduce the same exception. Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB val sortOrder = Literal( "abc" ).asc GenerateOrdering.generate(Array.fill(450)(sortOrder))
          Hide
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15461

          Show
          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/15461
          Hide
          proflin Liwei Lin(Inactive) added a comment -

          thank you Kazuaki Ishizaki this is really helpful!

          Show
          proflin Liwei Lin(Inactive) added a comment - thank you Kazuaki Ishizaki this is really helpful!
          Hide
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15480

          Show
          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/15480
          Hide
          harishk15 Harish added a comment -

          I have posted a scenario in stack overflow http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe ... let me know if you need any help on this. If this is already taken care you can ignore my comment.

          Show
          harishk15 Harish added a comment - I have posted a scenario in stack overflow http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe ... let me know if you need any help on this. If this is already taken care you can ignore my comment.
          Hide
          dondrake Don Drake added a comment -

          Liwei Lin I saw your PR, but noticed it's failing some tests. Just curious if you will have some time to resolve this. If I can help, please let me know.

          -Don

          Show
          dondrake Don Drake added a comment - Liwei Lin I saw your PR, but noticed it's failing some tests. Just curious if you will have some time to resolve this. If I can help, please let me know. -Don
          Hide
          lwlin Liwei Lin added a comment -

          Hi Don Drake, the latest commit (1ae9935b963b298459c09ab54b3f39f532230fef) is passing. It'd be great if you could checkout if the fix works for you. Thanks!

          Show
          lwlin Liwei Lin added a comment - Hi Don Drake , the latest commit (1ae9935b963b298459c09ab54b3f39f532230fef) is passing. It'd be great if you could checkout if the fix works for you. Thanks!
          Hide
          dondrake Don Drake added a comment -

          I compiled your branch and ran my large job and it finished successfully.

          Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't aware of the changes you were making.

          Can this get merged as well as backported to 2.0.x?

          Thanks so much.

          -Don

          Show
          dondrake Don Drake added a comment - I compiled your branch and ran my large job and it finished successfully. Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't aware of the changes you were making. Can this get merged as well as backported to 2.0.x? Thanks so much. -Don
          Hide
          lwlin Liwei Lin added a comment -

          Oh thanks for the feedback; it's helpful!

          The branch you're testing against is one way to fix this, and there's also an alternative way – we're still discussing which would be better. I think this shall get merged in possibly after Spark Summit Europe. Thanks!

          Show
          lwlin Liwei Lin added a comment - Oh thanks for the feedback; it's helpful! The branch you're testing against is one way to fix this, and there's also an alternative way – we're still discussing which would be better. I think this shall get merged in possibly after Spark Summit Europe. Thanks!
          Hide
          dondrake Don Drake added a comment -

          Update:

          It turns out that I am still getting this exception. I'll try to create a test case to duplicate it. Basically, I'm exploding a nested datastructure, then doing a union and then saving to Parquet. The resulting table has over 400 columns.

          I verified in spark-shell the exceptions do not occur with the test cases provided.

          Can you point me to your other solution? I can see if that works.

          Show
          dondrake Don Drake added a comment - Update: It turns out that I am still getting this exception. I'll try to create a test case to duplicate it. Basically, I'm exploding a nested datastructure, then doing a union and then saving to Parquet. The resulting table has over 400 columns. I verified in spark-shell the exceptions do not occur with the test cases provided. Can you point me to your other solution? I can see if that works.
          Hide
          lwlin Liwei Lin added a comment - - edited

          Hi Don Drake the other solution is still under discussion & in progress. It'd be super helpful if you could create and provide the "explode-union-parquet" reproducer which causes the fix to fail. Thank you!

          Show
          lwlin Liwei Lin added a comment - - edited Hi Don Drake the other solution is still under discussion & in progress. It'd be super helpful if you could create and provide the "explode-union-parquet" reproducer which causes the fix to fail. Thank you!
          Hide
          dondrake Don Drake added a comment -

          I'm struggling to get a simple case created.

          I'm curious though, if I compile my .jar file using sbt with Spark 2.0.1 but use your compiled branch of Spark 2.1.0-SNAPSHOT as a run-time (spark-submit), would you expect it to work?

          When using your compile branch of Spark 2.1.0-SNAPSHOT and execute a spark-shell the test cases provided in this JIRA pass. But my code fails.

          Also, the error message says "grows beyond 64k" as the compiler error but the output generates over 400k of source code. I'll try to attach the exact error message java code.

          Show
          dondrake Don Drake added a comment - I'm struggling to get a simple case created. I'm curious though, if I compile my .jar file using sbt with Spark 2.0.1 but use your compiled branch of Spark 2.1.0-SNAPSHOT as a run-time (spark-submit), would you expect it to work? When using your compile branch of Spark 2.1.0-SNAPSHOT and execute a spark-shell the test cases provided in this JIRA pass. But my code fails. Also, the error message says "grows beyond 64k" as the compiler error but the output generates over 400k of source code. I'll try to attach the exact error message java code.
          Hide
          dondrake Don Drake added a comment -

          Does this generated code help in resolving this?

          Show
          dondrake Don Drake added a comment - Does this generated code help in resolving this?
          Hide
          lwlin Liwei Lin added a comment -

          Don Drake yea I would expect it to work as long as your .jar file does not contain any org.apache.spark binaries, i.e. you're not building a fat jar which includes the spark dependencies.

          The error message in your attached file indicates it's `GeneratedClass$SpecificUnsafeProjection` growing beyond 64K - I believe it's somewhat related with `explode` and `project` code generation, which is a different issue from the `order` code generation issue covered here. :-D

          Would you mind opening a new JIRA for that issue?

          Show
          lwlin Liwei Lin added a comment - Don Drake yea I would expect it to work as long as your .jar file does not contain any org.apache.spark binaries, i.e. you're not building a fat jar which includes the spark dependencies. The error message in your attached file indicates it's `GeneratedClass$SpecificUnsafeProjection` growing beyond 64K - I believe it's somewhat related with `explode` and `project` code generation, which is a different issue from the `order` code generation issue covered here. :-D Would you mind opening a new JIRA for that issue?
          Hide
          dondrake Don Drake added a comment - - edited

          I've been struggling to duplicate this and finally came up with a strategy that duplicates it in a spark-shell. It's a combination of a wide dataset with nested (array) structures and performing a union that seem to trigger it.

          I opened SPARK-18207.

          Show
          dondrake Don Drake added a comment - - edited I've been struggling to duplicate this and finally came up with a strategy that duplicates it in a spark-shell. It's a combination of a wide dataset with nested (array) structures and performing a union that seem to trigger it. I opened SPARK-18207 .
          Hide
          barrybecker4 Barry Becker added a comment -

          I am encountering a similar exception in spark 1.6.3 when applying ml to a Dataframe with 204 columns.
          The code used is that from the spark-MDLP library, but I don't yet have a reproducible case for you.

                  at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:602)
                  at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:622)
                  at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:619)
                  at org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
                  at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
                  at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
                  ... 39 more
          Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
                  at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
                  at org.codehaus.janino.CodeContext.write(CodeContext.java:836)
                  at org.codehaus.janino.UnitCompiler.writeByte(UnitCompiler.java:10235)
                  at org.codehaus.janino.UnitCompiler.invoke(UnitCompiler.java:10048)
                  at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3986)
                  at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
                  at org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
                  at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
                  at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
                  at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3868)
                  at org.codehaus.janino.UnitCompiler.access$8600(UnitCompiler.java:185)
                  at org.codehaus.janino.UnitCompiler$10.visitParenthesizedExpression(UnitCompiler.java:3286)
                  at org.codehaus.janino.Java$ParenthesizedExpression.accept(Java.java:3830)
                  at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
                  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
                  at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3571)
                  at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:185)
                  at org.codehaus.janino.UnitCompiler$10.visitConditionalExpression(UnitCompiler.java:3260)
          
          Show
          barrybecker4 Barry Becker added a comment - I am encountering a similar exception in spark 1.6.3 when applying ml to a Dataframe with 204 columns. The code used is that from the spark-MDLP library, but I don't yet have a reproducible case for you. at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:602) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:622) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:619) at org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) ... 39 more Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:836) at org.codehaus.janino.UnitCompiler.writeByte(UnitCompiler.java:10235) at org.codehaus.janino.UnitCompiler.invoke(UnitCompiler.java:10048) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3868) at org.codehaus.janino.UnitCompiler.access$8600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitParenthesizedExpression(UnitCompiler.java:3286) at org.codehaus.janino.Java$ParenthesizedExpression.accept(Java.java:3830) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3571) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitConditionalExpression(UnitCompiler.java:3260)
          Hide
          harishk15 Harish added a comment -

          This issue is still open - https://github.com/apache/spark/pull/15480

          Show
          harishk15 Harish added a comment - This issue is still open - https://github.com/apache/spark/pull/15480
          Hide
          dongjoon Dongjoon Hyun added a comment -

          Hi, All.
          I removed the target version since 2.1.0 is passed the vote.

          Show
          dongjoon Dongjoon Hyun added a comment - Hi, All. I removed the target version since 2.1.0 is passed the vote.
          Hide
          barrybecker4 Barry Becker added a comment - - edited

          I found a workaround that allows me to avoid the 64 KB error, but it still reuns much slower than I expected. I switched to use a batch select statement insted of calls to withColumns in a loop.
          Here is an example of what I did
          Old way:

              stringCols.foreach(column => {  
                val qCol = col(column)
                datasetDf = datasetDf
                  .withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, lit(MISSING)).otherwise(qCol))
              })
          

          New way:

          val replaceStringNull = udf((s: String) => if (s == null) MISSING else s)
          var newCols = datasetDf.columns.map(column =>
                if (stringCols.contains(column))
                  replaceStringNull(col(column)).as(column + CLEAN_SUFFIX)
                else col(column))
          datasetDf = datasetDf.select(newCols:_*)
          

          This workaround only works on spark 2.0.2. I still get the 64 KB limit error when running the same thing with 1.6.3.

          Show
          barrybecker4 Barry Becker added a comment - - edited I found a workaround that allows me to avoid the 64 KB error, but it still reuns much slower than I expected. I switched to use a batch select statement insted of calls to withColumns in a loop. Here is an example of what I did Old way: stringCols.foreach(column => { val qCol = col(column) datasetDf = datasetDf .withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, lit(MISSING)).otherwise(qCol)) }) New way: val replaceStringNull = udf((s: String ) => if (s == null ) MISSING else s) var newCols = datasetDf.columns.map(column => if (stringCols.contains(column)) replaceStringNull(col(column)).as(column + CLEAN_SUFFIX) else col(column)) datasetDf = datasetDf.select(newCols:_*) This workaround only works on spark 2.0.2. I still get the 64 KB limit error when running the same thing with 1.6.3.
          Hide
          cloud_fan Wenchen Fan added a comment -

          Issue resolved by pull request 15480
          https://github.com/apache/spark/pull/15480

          Show
          cloud_fan Wenchen Fan added a comment - Issue resolved by pull request 15480 https://github.com/apache/spark/pull/15480
          Hide
          apachespark Apache Spark added a comment -

          User 'ueshin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17158

          Show
          apachespark Apache Spark added a comment - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17158
          Hide
          apachespark Apache Spark added a comment -

          User 'ueshin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17157

          Show
          apachespark Apache Spark added a comment - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17157
          Hide
          ethanyxu Ethan Xu added a comment -

          Liwei Lin I encountered the same error when handling a data frame with 3000+ columns. After pulling the master branch with the fix, built it and reran the code, I got an other exception (see below). Sorry I don't have a simple code to reproduce the error. Just want to see if anyone has seen this before.

          ...
          /* 308609 */     apply_1561(i);
          /* 308610 */     result.setTotalSize(holder.totalSize());
          /* 308611 */     return result;
          /* 308612 */   }
          /* 308613 */ }
          
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:941)
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:998)
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:995)
            at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
            at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
            at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
            at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
            ... 29 more
          Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0xFFFF
            at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
            at org.codehaus.janino.util.ClassFile.addConstantUtf8Info(ClassFile.java:454)
            at org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:440)
            at org.codehaus.janino.util.ClassFile.addConstantFieldrefInfo(ClassFile.java:344)
            at org.codehaus.janino.UnitCompiler.writeConstantFieldrefInfo(UnitCompiler.java:11109)
            at org.codehaus.janino.UnitCompiler.putfield(UnitCompiler.java:10788)
            at org.codehaus.janino.UnitCompiler.compileSet2(UnitCompiler.java:5634)
            at org.codehaus.janino.UnitCompiler.access$12400(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$17.visitFieldAccess(UnitCompiler.java:5616)
            at org.codehaus.janino.UnitCompiler$17.visitFieldAccess(UnitCompiler.java:5611)
            at org.codehaus.janino.Java$FieldAccess.accept(Java.java:3709)
            at org.codehaus.janino.UnitCompiler.compileSet(UnitCompiler.java:5611)
            at org.codehaus.janino.UnitCompiler.compileSet2(UnitCompiler.java:5625)
            at org.codehaus.janino.UnitCompiler.access$12200(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$17.visitAmbiguousName(UnitCompiler.java:5614)
            at org.codehaus.janino.UnitCompiler$17.visitAmbiguousName(UnitCompiler.java:5611)
            at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633)
            at org.codehaus.janino.UnitCompiler.compileSet(UnitCompiler.java:5611)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3193)
            at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143)
            at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139)
            at org.codehaus.janino.Java$Assignment.accept(Java.java:3847)
            at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
            at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
            at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
            at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
            at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
            at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
            at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
            at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
            at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
            at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
            at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
            at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
            at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
            at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
            at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
            at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
            at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
            at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
            at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
            at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
            at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
            at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
            at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
            at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229)
            at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196)
            at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91)
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935)
            ... 35 more
          
          Show
          ethanyxu Ethan Xu added a comment - Liwei Lin I encountered the same error when handling a data frame with 3000+ columns. After pulling the master branch with the fix, built it and reran the code, I got an other exception (see below). Sorry I don't have a simple code to reproduce the error. Just want to see if anyone has seen this before. ... /* 308609 */ apply_1561(i); /* 308610 */ result.setTotalSize(holder.totalSize()); /* 308611 */ return result; /* 308612 */ } /* 308613 */ } at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:941) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:998) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:995) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) ... 29 more Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0xFFFF at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) at org.codehaus.janino.util.ClassFile.addConstantUtf8Info(ClassFile.java:454) at org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:440) at org.codehaus.janino.util.ClassFile.addConstantFieldrefInfo(ClassFile.java:344) at org.codehaus.janino.UnitCompiler.writeConstantFieldrefInfo(UnitCompiler.java:11109) at org.codehaus.janino.UnitCompiler.putfield(UnitCompiler.java:10788) at org.codehaus.janino.UnitCompiler.compileSet2(UnitCompiler.java:5634) at org.codehaus.janino.UnitCompiler.access$12400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$17.visitFieldAccess(UnitCompiler.java:5616) at org.codehaus.janino.UnitCompiler$17.visitFieldAccess(UnitCompiler.java:5611) at org.codehaus.janino.Java$FieldAccess.accept(Java.java:3709) at org.codehaus.janino.UnitCompiler.compileSet(UnitCompiler.java:5611) at org.codehaus.janino.UnitCompiler.compileSet2(UnitCompiler.java:5625) at org.codehaus.janino.UnitCompiler.access$12200(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$17.visitAmbiguousName(UnitCompiler.java:5614) at org.codehaus.janino.UnitCompiler$17.visitAmbiguousName(UnitCompiler.java:5611) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) at org.codehaus.janino.UnitCompiler.compileSet(UnitCompiler.java:5611) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3193) at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935) ... 35 more
          Hide
          kiszk Kazuaki Ishizaki added a comment - - edited

          You are seeing another exception.
          While This PR will avoid this exception, this PR is still under review.

          Show
          kiszk Kazuaki Ishizaki added a comment - - edited You are seeing another exception. While This PR will avoid this exception, this PR is still under review.
          Hide
          ethanyxu Ethan Xu added a comment -

          Thanks Kazuaki Ishizaki ! I'm following that PR. I'm surprised that 3000 columns broke the code, since it's not particularly wide data.

          Show
          ethanyxu Ethan Xu added a comment - Thanks Kazuaki Ishizaki ! I'm following that PR. I'm surprised that 3000 columns broke the code, since it's not particularly wide data.
          Hide
          barrybecker4 Barry Becker added a comment -

          I checked out the the v2.1.1 tag of spark from github, but when I build and try to run all the unit tests, it fails on this test:

          • GenerateOrdering with FloatType
          • GenerateOrdering with ShortType
          • SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB *** FAILED ***
            com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
            at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
            at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
            at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
            at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
            at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188)
            at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43)
            at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
            at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138)
            at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
            ...
            Cause: java.lang.StackOverflowError:
            at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
            at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
            ...

          The command I used is
          ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.5 package
          Did I do something wrong?

          Show
          barrybecker4 Barry Becker added a comment - I checked out the the v2.1.1 tag of spark from github, but when I build and try to run all the unit tests, it fails on this test: GenerateOrdering with FloatType GenerateOrdering with ShortType SPARK-16845 : GeneratedClass$SpecificOrdering grows beyond 64 KB *** FAILED *** com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889) at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138) at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131) ... Cause: java.lang.StackOverflowError: at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) ... The command I used is ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.5 package Did I do something wrong?
          Hide
          otissmart Otis Smart added a comment - - edited

          Hello!

          1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~50000 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()).

          2. Was the aforementioned patch (aka fix) (https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please?

          py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
          : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
          /* 001 */ public SpecificOrdering generate(Object[] references)

          { /* 002 */ return new SpecificOrdering(references); /* 003 */ }

          /* 004 */
          /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
          /* 006 */
          /* 007 */ private Object[] references;
          /* 008 */
          /* 009 */
          /* 010 */ public SpecificOrdering(Object[] references)

          { /* 011 */ this.references = references; /* 012 */ /* 013 */ }

          /* 014 */
          /* 015 */
          /* 016 */
          /* 017 */ public int compare(InternalRow a, InternalRow b) {
          /* 018 */ InternalRow i = null; // Holds current row being evaluated.
          /* 019 */
          /* 020 */ i = a;
          /* 021 */ boolean isNullA;
          /* 022 */ double primitiveA;
          /* 023 */

          { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ }

          /* 029 */ i = b;
          /* 030 */ boolean isNullB;
          /* 031 */ double primitiveB;
          /* 032 */

          { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ }

          /* 038 */ if (isNullA && isNullB)

          { /* 039 */ // Nothing /* 040 */ }

          else if (isNullA)

          { /* 041 */ return -1; /* 042 */ }

          else if (isNullB)

          { /* 043 */ return 1; /* 044 */ }

          else {
          /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
          /* 046 */ if (comp != 0)

          { /* 047 */ return comp; /* 048 */ }

          /* 049 */ }
          /* 050 */
          /* 051 */

          Show
          otissmart Otis Smart added a comment - - edited Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~50000 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()). 2. Was the aforementioned patch (aka fix) ( https://github.com/apache/spark/pull/15480 ) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ } /* 038 */ if (isNullA && isNullB) { /* 039 */ // Nothing /* 040 */ } else if (isNullA) { /* 041 */ return -1; /* 042 */ } else if (isNullB) { /* 043 */ return 1; /* 044 */ } else { /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); /* 046 */ if (comp != 0) { /* 047 */ return comp; /* 048 */ } /* 049 */ } /* 050 */ /* 051 */
          Hide
          mvelusce Marco Veluscek added a comment -

          Hello,
          I have just encountered a similar issue when doing except on two large dataframes.
          My code executed with Spark 2.1.0 fails with an exception. The same code with Spark 2.2.0 works, but logs several exceptions.
          Since, I have to work with 2.1.0 because of company policies, I would like to know whether there is a way to fix or to work around this issue in 2.1.0?

          Here are more details about the problem.
          On my company cluster, I am working with Spark version 2.1.0.cloudera1 using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112).

          The two dataframes have about 1 million rows and 467 columns.
          When I do the except dataframe1.except(dataframe2) I get the following exception:

          Exception_with_2.1.0
          scheduler.TaskSetManager: Lost task 10.0 in stage 80.0 (TID 4146, cdhworker05.itec.lab, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.co
          dehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB
          

          Then the logs show the generated code for the class SpecificPredicate which has more than 5000 rows.

          I wrote a small script to reproduce the error:

          testExcept.scala
          import org.apache.spark.sql.functions._
          
          import spark.implicits._
          
          import org.apache.spark.sql.{Row, SparkSession}
          import org.apache.spark.sql.types.{DoubleType, StructField, StructType, IntegerType}
          
          import scala.util.Random
          
          def start(rows: Int, cols: Int, col: String, spark: SparkSession) = {
          
               val data = (1 to rows).map(_ => Seq.fill(cols)(1))
          
               val colNames = (1 to cols).mkString(",")
               val sch = StructType(colNames.split(",").map(fieldName => StructField(fieldName, IntegerType, true)))
          
               val rdd = spark.sparkContext.parallelize(data.map(x => Row(x:_*)))
               spark.sqlContext.createDataFrame(rdd, sch)
          }
          
          val dataframe1 = start(1000, 500, "column", spark)
          val dataframe2 = start(1000, 500, "column", spark)
          
          val res = dataframe1.except(dataframe2)
          
          res.count()
          

          I have also tried with a local Spark installation, version 2.2.0 using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
          With this Spark version, the code does not fail but it logs several exceptions all saying the below:

          Exception_with_2.2.0
          17/09/21 12:42:26 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB
          

          Then the same generated code is logged.

          In addition, this line is also logged several times:

          17/09/21 12:46:20 WARN SortMergeJoinExec: Codegen disabled for this expression: (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((...
          

          Since I have to work with Spark 2.1.0, is there a way to work around this problem? Maybe disabling the code gen?

          Thank you for your help.

          Show
          mvelusce Marco Veluscek added a comment - Hello, I have just encountered a similar issue when doing except on two large dataframes. My code executed with Spark 2.1.0 fails with an exception. The same code with Spark 2.2.0 works, but logs several exceptions. Since, I have to work with 2.1.0 because of company policies, I would like to know whether there is a way to fix or to work around this issue in 2.1.0? Here are more details about the problem. On my company cluster, I am working with Spark version 2.1.0.cloudera1 using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112). The two dataframes have about 1 million rows and 467 columns. When I do the except dataframe1.except(dataframe2) I get the following exception: Exception_with_2.1.0 scheduler.TaskSetManager: Lost task 10.0 in stage 80.0 (TID 4146, cdhworker05.itec.lab, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.co dehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB Then the logs show the generated code for the class SpecificPredicate which has more than 5000 rows. I wrote a small script to reproduce the error: testExcept.scala import org.apache.spark.sql.functions._ import spark.implicits._ import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, StructField, StructType, IntegerType} import scala.util.Random def start(rows: Int, cols: Int, col: String , spark: SparkSession) = { val data = (1 to rows).map(_ => Seq.fill(cols)(1)) val colNames = (1 to cols).mkString( "," ) val sch = StructType(colNames.split( "," ).map(fieldName => StructField(fieldName, IntegerType, true ))) val rdd = spark.sparkContext.parallelize(data.map(x => Row(x:_*))) spark.sqlContext.createDataFrame(rdd, sch) } val dataframe1 = start(1000, 500, "column" , spark) val dataframe2 = start(1000, 500, "column" , spark) val res = dataframe1.except(dataframe2) res.count() I have also tried with a local Spark installation, version 2.2.0 using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131). With this Spark version, the code does not fail but it logs several exceptions all saying the below: Exception_with_2.2.0 17/09/21 12:42:26 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB Then the same generated code is logged. In addition, this line is also logged several times: 17/09/21 12:46:20 WARN SortMergeJoinExec: Codegen disabled for this expression: (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((... Since I have to work with Spark 2.1.0, is there a way to work around this problem? Maybe disabling the code gen? Thank you for your help.
          Hide
          kiszk Kazuaki Ishizaki added a comment -

          Marco Veluscek Thank you for reporting an issue with repro. I can reproduce this.

          If I am correct, Spark 2.2 can fall back into a path disabling code gen by this PR. Once we tried to backport this to Spark 2.1, it was rejected.

          Show
          kiszk Kazuaki Ishizaki added a comment - Marco Veluscek Thank you for reporting an issue with repro. I can reproduce this. If I am correct, Spark 2.2 can fall back into a path disabling code gen by this PR . Once we tried to backport this to Spark 2.1, it was rejected.
          Hide
          kiszk Kazuaki Ishizaki added a comment -

          This PR allows us to execute the above program testExcept.scala without throwing an exception.

          Show
          kiszk Kazuaki Ishizaki added a comment - This PR allows us to execute the above program testExcept.scala without throwing an exception.

            People

            • Assignee:
              lwlin Liwei Lin
              Reporter:
              jack15710 hejie
            • Votes:
              12 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development