Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5722

Infer_schema_type incorrect for Integers in pyspark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.2.2
    • PySpark
    • None

    Description

      The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer.

      Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON.

      Here's an example:

      >>> sqlCtx = SQLContext(sc)
      >>> from pyspark.sql import Row
      >>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
      >>> srdd = sqlCtx.inferSchema(rdd)
      >>> srdd.schema()
      StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
      

      That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred.

      More tests:

      >>> from pyspark.sql import _infer_type
      # OK
      >>> print _infer_type(1)
      IntegerType
      # OK
      >>> print _infer_type(2**31-1)
      IntegerType
      #WRONG
      >>> print _infer_type(2**31)
      #WRONG
      IntegerType
      >>> print _infer_type(2**61 )
      #OK
      IntegerType
      >>> print _infer_type(2**71 )
      LongType
      

      Java Primitive Types defined:
      http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

      Python Built-in Types:
      https://docs.python.org/2/library/stdtypes.html#typesnumeric

      Attachments

        Activity

          People

            dondrake Don Drake
            dondrake Don Drake
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: