Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34803

Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1
    • 3.1.2, 3.2.0
    • PySpark
    • None

    Description

      When checking that the we can import either pandas or pyarrow, we except any ImportError and raise an error declaring the minimum version of the respective package that's required to be in the Python environment.

      We don't however, pass the ImportError that might have been thrown by the package itself. Take pandas as an example, when we call import pandas, pandas itself might be in the environment, but can throw an ImportError https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 if another package it requires isn't there. This error wouldn't be passed through and we'd end up getting a misleading error message that states that pandas isn't in the environment, while in fact it is but something else makes us unable to import it.

      I believe this can be improved by chaining the exceptions and am happy to provide said contribution.

      Attachments

        Activity

          People

            johnhany97 John Hany
            johnhany97 John Hany
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: