Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38027

Undefined link function causing error in GLM that uses Tweedie family

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.2
    • None
    • ML
    • Running on Mac OS X Monterey

    Description

      I am trying to use the GLM regression with a Tweedie distribution so I can model insurance use cases. I have set up a very simple example adapted from the docs:

          def create_fake_losses_data(self):
              df = self._spark.createDataFrame([
                  ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
                  ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
                  ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
                  ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", "label", "offset", "weight", "features"])
              logging.info(df.collect())
              setattr(self, 'fake_data', df)
              try:
                  glr = GeneralizedLinearRegression(
                      family="tweedie", variancePower=1.5, linkPower=-1, offsetCol='offset')
                  glr.setRegParam(0.3)
                  model = glr.fit(df)
                  logging.info(model)
              except Py4JJavaError as e:
                  print(e)
              return self
      

      This causes the following error:

      *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
      : java.util.NoSuchElementException: Failed to find a default value for link*
      at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
      at scala.Option.getOrElse(Option.scala:189)
      at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
      at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
      at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
      at org.apache.spark.ml.param.Params.$(params.scala:762)
      at org.apache.spark.ml.param.Params.$$(params.scala:762)
      at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
      at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:282)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:238)
      at java.lang.Thread.run(Thread.java:748)

      I was under the assumption that the default value for link is None, if not defined otherwise.
       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            zamir.evan@gmail.com Evan Zamir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: