Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1944 Upgrade Spark version to 1.5.1
  3. BIGTOP-2154

spark-shell doesn't start anymore without Hive libs in the classpath

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.1.0
    • Component/s: spark
    • Labels:
      None

      Description

      Since BIGTOP-2104 the spark-shell is completely broken because now it requires datanucleus jars to be available for JDO. This is wrong. Spark is an execution engine and should require a query planner like Hive to be present.

        Issue Links

          Activity

          Hide
          cos Konstantin Boudnik added a comment -

          Committed and pushed to the master.

          Show
          cos Konstantin Boudnik added a comment - Committed and pushed to the master.
          Hide
          rnowling RJ Nowling added a comment -

          "never agreed on having Hive be on by default" – yes, this should be decided by the community, with all implications understood. Getting us unstuck for now sounds good, then we can discuss this all at length.

          Show
          rnowling RJ Nowling added a comment - "never agreed on having Hive be on by default" – yes, this should be decided by the community, with all implications understood. Getting us unstuck for now sounds good, then we can discuss this all at length.
          Hide
          cos Konstantin Boudnik added a comment -

          My long term issue is that we never had agreed on having Hive to be on by default. If this is ok with the community - fine, let's keep it. And have these libs installed as well - by default, not as an aux Puppet recipe. Also, We would need to work out funny stuff like

          • if I run spark-shell under the user which doesn't have write permissions to the current directory it will fail with "Can not create metadata_db" message or something similar.

          Ok, I will commit this for now.

          Show
          cos Konstantin Boudnik added a comment - My long term issue is that we never had agreed on having Hive to be on by default. If this is ok with the community - fine, let's keep it. And have these libs installed as well - by default, not as an aux Puppet recipe. Also, We would need to work out funny stuff like if I run spark-shell under the user which doesn't have write permissions to the current directory it will fail with "Can not create metadata_db" message or something similar. Ok, I will commit this for now.
          Hide
          rnowling RJ Nowling added a comment -

          That dependency seems reasonable – can the datanucleus RPM be used more broadly than Spark to prevent duplication?

          Show
          rnowling RJ Nowling added a comment - That dependency seems reasonable – can the datanucleus RPM be used more broadly than Spark to prevent duplication?
          Hide
          rnowling RJ Nowling added a comment - - edited

          Installing the datanucleus JAR is good for now.

          What's your issue with the requirement on datanucleus long term? (I'm not even really sure what datanucleus does...)

          Show
          rnowling RJ Nowling added a comment - - edited Installing the datanucleus JAR is good for now. What's your issue with the requirement on datanucleus long term? (I'm not even really sure what datanucleus does...)
          Hide
          cos Konstantin Boudnik added a comment -

          Here's an ugly fix, just for the sake of getting 1.1 out. This issue was overhanging far too long. I wish I have caught it during the review and kept it from creeping in ;(

          Feedback is needed!

          Show
          cos Konstantin Boudnik added a comment - Here's an ugly fix, just for the sake of getting 1.1 out. This issue was overhanging far too long. I wish I have caught it during the review and kept it from creeping in ;( Feedback is needed!
          Hide
          cos Konstantin Boudnik added a comment -

          I looked at it again, and I really don't see a good solution ;( Declaring a dependency from spark-core into some other obscure package sounds like a real bad idea. I am almost tempted to just mode the datanucleus libs to be a part of the spark-core. But it also quite architecturally unsound. Argh...

          Show
          cos Konstantin Boudnik added a comment - I looked at it again, and I really don't see a good solution ;( Declaring a dependency from spark-core into some other obscure package sounds like a real bad idea. I am almost tempted to just mode the datanucleus libs to be a part of the spark-core. But it also quite architecturally unsound. Argh...
          Hide
          cos Konstantin Boudnik added a comment -

          Ok, I will be moving forward on this by declaring package dependency between spark-core and spark-datanucleus. This isn't ideal and conceptually wrong, but at least it will be a clear declaration impliying the core doesn't work without nucleus. And I want to unblock the release

          Show
          cos Konstantin Boudnik added a comment - Ok, I will be moving forward on this by declaring package dependency between spark-core and spark-datanucleus. This isn't ideal and conceptually wrong, but at least it will be a clear declaration impliying the core doesn't work without nucleus. And I want to unblock the release
          Hide
          cos Konstantin Boudnik added a comment -

          Which one do you prefer? The package change or the Puppet one? I favor the former, honestly, as it is a least ugly of the two.

          Show
          cos Konstantin Boudnik added a comment - Which one do you prefer? The package change or the Puppet one? I favor the former, honestly, as it is a least ugly of the two.
          Hide
          jayunit100 jay vyas added a comment -

          yup Sounds good as workaround

          Show
          jayunit100 jay vyas added a comment - yup Sounds good as workaround
          Hide
          cos Konstantin Boudnik added a comment - - edited

          In the interest of time and the release readiness, I am moving to simply add the package dependency between spark-core and spark-datanucleus packages as an ugly workaround. Any objections?

          Alternatively, we can fix it in the Puppet layer

          Show
          cos Konstantin Boudnik added a comment - - edited In the interest of time and the release readiness, I am moving to simply add the package dependency between spark-core and spark-datanucleus packages as an ugly workaround. Any objections? Alternatively, we can fix it in the Puppet layer
          Hide
          cos Konstantin Boudnik added a comment -

          And I can confirm that with spark-datanucleus package manually installed spark-shell indeed works.

          Show
          cos Konstantin Boudnik added a comment - And I can confirm that with spark-datanucleus package manually installed spark-shell indeed works.
          Hide
          cos Konstantin Boudnik added a comment -

          Basically, my main concern is this: everything that is installed as a part of Bigtop deployment should work. We shouldn't bring up thriftserver by default if it isn't explicitly asked for. Not, the absence of the datanucleus package (if wasn't explicitly installed) should not break any other stuff. The easiest way for me to fix it is to get rid of the whole datanucleus, but it doesn't look like a very desirable solution.

          So, could you please take a look and fix it, so we can move forward with the release? Thanks in advance!

          Show
          cos Konstantin Boudnik added a comment - Basically, my main concern is this: everything that is installed as a part of Bigtop deployment should work. We shouldn't bring up thriftserver by default if it isn't explicitly asked for. Not, the absence of the datanucleus package (if wasn't explicitly installed) should not break any other stuff. The easiest way for me to fix it is to get rid of the whole datanucleus, but it doesn't look like a very desirable solution. So, could you please take a look and fix it, so we can move forward with the release? Thanks in advance!
          Hide
          cos Konstantin Boudnik added a comment -

          > but I split the datanucleus jars off to a separate spark-datanucleus RPM
          oh, I missed that somehow. Joke on me, I guess.

          Well, what I am saying is not that spark assembly expect the datanucleus jars. It is that spark-shell is failing without them. May be this is a configuration issue? I don't know enough knowledge about spark to tell, but am trying to figure out what's going on with it because I am trying to have 1.1 soon. If you have any idea on this - I'd really appreciate it.

          Show
          cos Konstantin Boudnik added a comment - > but I split the datanucleus jars off to a separate spark-datanucleus RPM oh, I missed that somehow. Joke on me, I guess. Well, what I am saying is not that spark assembly expect the datanucleus jars. It is that spark-shell is failing without them. May be this is a configuration issue? I don't know enough knowledge about spark to tell, but am trying to figure out what's going on with it because I am trying to have 1.1 soon. If you have any idea on this - I'd really appreciate it.
          Hide
          jonathak Jonathan Kelly added a comment -

          Konstantin Boudnik, sorry for the confusion, but I split the datanucleus jars off to a separate spark-datanucleus RPM. The reason I did this is so that the enormous spark-core RPM can be installed only on the master node (when using Spark on YARN--I'm pretty sure Spark Standalone would require the spark-core on every node), while spark-datanucleus can be installed on all nodes (which is necessary in order to be able to use Hive support with an application run in yarn-cluster mode).

          Also, I did not realize that building Spark with -Phive causes the Spark assembly to expect that the datanucleus jars are present. That's what you are saying, right? This actually seems really strange to me, since I thought that Spark would only use Hive integration if the datanucleus jars are present but be fine otherwise.

          Show
          jonathak Jonathan Kelly added a comment - Konstantin Boudnik , sorry for the confusion, but I split the datanucleus jars off to a separate spark-datanucleus RPM. The reason I did this is so that the enormous spark-core RPM can be installed only on the master node (when using Spark on YARN--I'm pretty sure Spark Standalone would require the spark-core on every node), while spark-datanucleus can be installed on all nodes (which is necessary in order to be able to use Hive support with an application run in yarn-cluster mode). Also, I did not realize that building Spark with -Phive causes the Spark assembly to expect that the datanucleus jars are present. That's what you are saying, right? This actually seems really strange to me, since I thought that Spark would only use Hive integration if the datanucleus jars are present but be fine otherwise.
          Hide
          cos Konstantin Boudnik added a comment -

          Well, that's weird... datanucleus jars are a part of the spark dist assembly but for whatever reason they aren't get extracted during the package build. As the result, they are missing in the installation which leads to the spark-shell failure. Let me dig more.... Perhaps keeping them might be a smaller change after all.

          Show
          cos Konstantin Boudnik added a comment - Well, that's weird... datanucleus jars are a part of the spark dist assembly but for whatever reason they aren't get extracted during the package build. As the result, they are missing in the installation which leads to the spark-shell failure. Let me dig more.... Perhaps keeping them might be a smaller change after all.
          Hide
          cos Konstantin Boudnik added a comment -

          Ok, I will try to make the change and validate the patch for this over the weekend.

          Show
          cos Konstantin Boudnik added a comment - Ok, I will try to make the change and validate the patch for this over the weekend.
          Hide
          jayunit100 jay vyas added a comment -

          i agree, if hive-thriftserver opts are only there for hive integration, best to remove them.
          after all spark sql is easy enough that you dont really even need the hive bindings unless youre planning on supporting the full hive usecase (warehouse, hiveserver, metadata,...)

          Show
          jayunit100 jay vyas added a comment - i agree, if hive-thriftserver opts are only there for hive integration, best to remove them. after all spark sql is easy enough that you dont really even need the hive bindings unless youre planning on supporting the full hive usecase (warehouse, hiveserver, metadata,...)
          Hide
          cos Konstantin Boudnik added a comment -

          We don't provide any special pass-through ways of adding things to the component builds via the top-level build. I don't see why Spark should be any different. If someone needs to build Spark with some hive stuff in it - they can easily do it by modifying do-component-build as you said.
          Apparently, the issue we see isn't a Bigtop one. It is a lame design on the Spark side. The best we can do is to stay away from using it, until it is fixed in the upstream

          Show
          cos Konstantin Boudnik added a comment - We don't provide any special pass-through ways of adding things to the component builds via the top-level build. I don't see why Spark should be any different. If someone needs to build Spark with some hive stuff in it - they can easily do it by modifying do-component-build as you said. Apparently, the issue we see isn't a Bigtop one. It is a lame design on the Spark side. The best we can do is to stay away from using it, until it is fixed in the upstream
          Hide
          evans_ye Evans Ye added a comment -

          Agree. It doesn't need to be everything enabled by default.
          Perhaps we can add options like -Psparkbuildopts?
          Or simply add a line aside spark-rpm/spark-deb in gradle description which tells user to modify do-component-build directly if needed.

          Show
          evans_ye Evans Ye added a comment - Agree. It doesn't need to be everything enabled by default. Perhaps we can add options like -Psparkbuildopts? Or simply add a line aside spark-rpm/spark-deb in gradle description which tells user to modify do-component-build directly if needed.
          Hide
          cos Konstantin Boudnik added a comment -

          Hmmm and looking into Spark top-lelel pom.xml I don't see hive profile at all.
          There's also no bigtop-dist anymore. Needs to be nuked, it seems

          Show
          cos Konstantin Boudnik added a comment - Hmmm and looking into Spark top-lelel pom.xml I don't see hive profile at all . There's also no bigtop-dist anymore. Needs to be nuked, it seems
          Hide
          cos Konstantin Boudnik added a comment -

          I'd suggest we get rid of the -Phive -Phive-thriftserver profiles in the build time, until such a time the issue is fixed somehow. In the current shape it is pretty broken ;(

          Show
          cos Konstantin Boudnik added a comment - I'd suggest we get rid of the -Phive -Phive-thriftserver profiles in the build time, until such a time the issue is fixed somehow. In the current shape it is pretty broken ;(
          Hide
          cos Konstantin Boudnik added a comment -

          Unless the said jars aren't added via --jars you'll be hitting something like

          ava.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
                  at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
                  at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
                  at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
                  at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
                  at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
          
          Show
          cos Konstantin Boudnik added a comment - Unless the said jars aren't added via --jars you'll be hitting something like ava.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)

            People

            • Assignee:
              cos Konstantin Boudnik
              Reporter:
              cos Konstantin Boudnik
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development