Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6815

Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
      None

      Description

      I'm unable to read timestamps saved as Parquet data via Pandas with Hive or Presto. These are the versions of the various pieces of software I'm using:

      • Pandas 0.24.1
      • PyArrow 0.13.0
      • Hadoop 3.0.3
      • Hive 2.3.3
      • Presto 0.215

      The Hadoop setup steps I took can be found here.

      This is what I did to generate the Parquet file.

      from   datetime import datetime
      from   StringIO import StringIO
      
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      
      df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}])
      
      parquet_buffer = StringIO()
      pa_table = pa.Table.from_pandas(df, preserve_index=False)
      writer = pq.ParquetWriter(parquet_buffer, pa_table.schema)
      writer.write_table(pa_table)
      writer.close()
      
      with open('example.pq', 'w+b') as f:
          f.write(parquet_buffer.getvalue())
      

      This is its raw contents:

      $ hexdump -C example.pq
      
      00000000  50 41 52 31 15 04 15 10  15 14 4c 15 02 15 04 12  |PAR1......L.....|
      00000010  00 00 08 1c 04 00 00 00  54 65 73 74 15 00 15 12  |........Test....|
      00000020  15 16 2c 15 02 15 04 15  06 15 06 1c 18 04 54 65  |..,...........Te|
      00000030  73 74 18 04 54 65 73 74  16 00 00 00 00 09 20 02  |st..Test...... .|
      00000040  00 00 00 02 01 01 02 00  26 90 01 1c 15 0c 19 35  |........&......5|
      00000050  04 00 06 19 18 01 61 15  02 16 02 16 80 01 16 88  |......a.........|
      00000060  01 26 38 26 08 1c 36 00  28 04 54 65 73 74 18 04  |.&8&..6.(.Test..|
      00000070  54 65 73 74 00 00 00 15  04 15 10 15 14 4c 15 02  |Test.........L..|
      00000080  15 04 12 00 00 08 1c 10  4f 48 96 63 94 05 00 15  |........OH.c....|
      00000090  00 15 12 15 16 2c 15 02  15 04 15 06 15 06 1c 18  |.....,..........|
      000000a0  08 10 4f 48 96 63 94 05  00 18 08 10 4f 48 96 63  |..OH.c......OH.c|
      000000b0  94 05 00 16 00 00 00 00  09 20 02 00 00 00 02 01  |......... ......|
      000000c0  01 02 00 26 86 03 1c 15  04 19 35 04 00 06 19 18  |...&......5.....|
      000000d0  01 62 15 02 16 02 16 90  01 16 98 01 26 9e 02 26  |.b..........&..&|
      000000e0  ee 01 1c 18 08 10 4f 48  96 63 94 05 00 18 08 10  |......OH.c......|
      000000f0  4f 48 96 63 94 05 00 16  00 28 08 10 4f 48 96 63  |OH.c.....(..OH.c|
      00000100  94 05 00 18 08 10 4f 48  96 63 94 05 00 00 00 00  |......OH.c......|
      00000110  15 02 19 3c 35 00 18 06  73 63 68 65 6d 61 15 04  |...<5...schema..|
      00000120  00 15 0c 25 02 18 01 61  00 15 04 25 02 18 01 62  |...%...a...%...b|
      00000130  25 14 00 16 02 19 1c 19  2c 26 90 01 1c 15 0c 19  |%.......,&......|
      00000140  35 04 00 06 19 18 01 61  15 02 16 02 16 80 01 16  |5......a........|
      00000150  88 01 26 38 26 08 1c 36  00 28 04 54 65 73 74 18  |..&8&..6.(.Test.|
      00000160  04 54 65 73 74 00 00 00  26 86 03 1c 15 04 19 35  |.Test...&......5|
      00000170  04 00 06 19 18 01 62 15  02 16 02 16 90 01 16 98  |......b.........|
      00000180  01 26 9e 02 26 ee 01 1c  18 08 10 4f 48 96 63 94  |.&..&......OH.c.|
      00000190  05 00 18 08 10 4f 48 96  63 94 05 00 16 00 28 08  |.....OH.c.....(.|
      000001a0  10 4f 48 96 63 94 05 00  18 08 10 4f 48 96 63 94  |.OH.c......OH.c.|
      000001b0  05 00 00 00 00 16 a0 02  16 02 00 19 1c 18 06 70  |...............p|
      000001c0  61 6e 64 61 73 18 db 02  7b 22 63 72 65 61 74 6f  |andas...{"creato|
      000001d0  72 22 3a 20 7b 22 76 65  72 73 69 6f 6e 22 3a 20  |r": {"version": |
      000001e0  22 30 2e 31 33 2e 30 22  2c 20 22 6c 69 62 72 61  |"0.13.0", "libra|
      000001f0  72 79 22 3a 20 22 70 79  61 72 72 6f 77 22 7d 2c  |ry": "pyarrow"},|
      00000200  20 22 70 61 6e 64 61 73  5f 76 65 72 73 69 6f 6e  | "pandas_version|
      00000210  22 3a 20 6e 75 6c 6c 2c  20 22 69 6e 64 65 78 5f  |": null, "index_|
      00000220  63 6f 6c 75 6d 6e 73 22  3a 20 5b 5d 2c 20 22 63  |columns": [], "c|
      00000230  6f 6c 75 6d 6e 73 22 3a  20 5b 7b 22 6d 65 74 61  |olumns": [{"meta|
      00000240  64 61 74 61 22 3a 20 6e  75 6c 6c 2c 20 22 66 69  |data": null, "fi|
      00000250  65 6c 64 5f 6e 61 6d 65  22 3a 20 22 61 22 2c 20  |eld_name": "a", |
      00000260  22 6e 61 6d 65 22 3a 20  22 61 22 2c 20 22 6e 75  |"name": "a", "nu|
      00000270  6d 70 79 5f 74 79 70 65  22 3a 20 22 6f 62 6a 65  |mpy_type": "obje|
      00000280  63 74 22 2c 20 22 70 61  6e 64 61 73 5f 74 79 70  |ct", "pandas_typ|
      00000290  65 22 3a 20 22 62 79 74  65 73 22 7d 2c 20 7b 22  |e": "bytes"}, {"|
      000002a0  6d 65 74 61 64 61 74 61  22 3a 20 6e 75 6c 6c 2c  |metadata": null,|
      000002b0  20 22 66 69 65 6c 64 5f  6e 61 6d 65 22 3a 20 22  | "field_name": "|
      000002c0  62 22 2c 20 22 6e 61 6d  65 22 3a 20 22 62 22 2c  |b", "name": "b",|
      000002d0  20 22 6e 75 6d 70 79 5f  74 79 70 65 22 3a 20 22  | "numpy_type": "|
      000002e0  64 61 74 65 74 69 6d 65  36 34 5b 6e 73 5d 22 2c  |datetime64[ns]",|
      000002f0  20 22 70 61 6e 64 61 73  5f 74 79 70 65 22 3a 20  | "pandas_type": |
      00000300  22 64 61 74 65 74 69 6d  65 22 7d 5d 2c 20 22 63  |"datetime"}], "c|
      00000310  6f 6c 75 6d 6e 5f 69 6e  64 65 78 65 73 22 3a 20  |olumn_indexes": |
      00000320  5b 5d 7d 00 18 22 70 61  72 71 75 65 74 2d 63 70  |[]}.."parquet-cp|
      00000330  70 20 76 65 72 73 69 6f  6e 20 31 2e 35 2e 31 2d  |p version 1.5.1-|
      00000340  53 4e 41 50 53 48 4f 54  19 2c 1c 00 00 1c 00 00  |SNAPSHOT.,......|
      00000350  00 41 02 00 00 50 41 52  31                       |.A...PAR1|
      00000359
      

      These are the strings extracted from that file.

      $ strings example.pq
      
      PAR1
      Test
      Test
      Test
      Test
      Test
      schema
      Test
      Test
      pandas
      {"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", "name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}], "column_indexes": []}
      "parquet-cpp version 1.5.1-SNAPSHOT
      PAR1
      

      If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety of data types.

      $ hdfs dfs -mkdir /testing/
      $ hdfs dfs -copyFromLocal example.pq /testing/
      $ hdfs dfs -ls /testing/example.pq
      
      -rw-r--r--   1 ubuntu supergroup        857 2019-10-08 10:26 /testing/example.pq
      
      $ hive
      

      It can't be read as a DATETIME field:

      CREATE EXTERNAL TABLE testing (
          a STRING,
          b DATETIME
      ) STORED AS parquet
      LOCATION '/testing/';
      
      FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. Please use DATE or TIMESTAMP instead
      

      It can't be read as a TIMESTAMP:

      CREATE EXTERNAL TABLE testing (
          a STRING,
          b TIMESTAMP
      ) STORED AS parquet
      LOCATION '/testing/';
      
      SELECT * FROM testing;
      
      Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
      

      I can extract it as a BIGINT but I'm unable to find a math formula to convert it to the correct time.

      DROP TABLE `testing`;
      
      CREATE EXTERNAL TABLE `testing` (
          a STRING,
          b BIGINT
      ) STORED AS parquet
      LOCATION '/testing/';
      
      Test    1570530327547664
      
      SELECT CAST(b AS TIMESTAMP) FROM testing;
      
      51738-02-15 08:19:077.664
      
      SELECT CAST(b/1000 AS TIMESTAMP) FROM testing;
      
      51738-02-15 08:19:077.664
      

      I'll generate the same content using Hive and show the contents of the Parquet file it produces:

      CREATE TABLE `testing2` (
          a STRING,
          b TIMESTAMP
      ) STORED AS parquet
      LOCATION '/testing2/';
      
      INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061");
      
      SELECT * FROM testing2;
      
      Test    2019-10-08 10:34:30.061
      

      The timestamp is usable.

      SELECT YEAR(B) FROM testing2;
      
      2019
      
      $ hdfs dfs -copyToLocal /testing2/000000_0 ./
      $ hexdump -C 000000_0
      
      00000000  50 41 52 31 15 00 15 1c  15 1c 2c 15 02 15 00 15  |PAR1......,.....|
      00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
      00000020  16 00 00 00 00 02 00 00  00 03 01 04 00 00 00 54  |...............T|
      00000030  65 73 74 15 04 15 18 15  18 4c 15 02 15 04 00 00  |est......L......|
      00000040  40 65 76 e0 9f 22 00 00  8d 84 25 00 15 00 15 10  |@ev.."....%.....|
      00000050  15 10 2c 15 02 15 04 15  06 15 08 1c 18 0c 40 65  |..,...........@e|
      00000060  76 e0 9f 22 00 00 8d 84  25 00 18 0c 40 65 76 e0  |v.."....%...@ev.|
      00000070  9f 22 00 00 8d 84 25 00  16 00 00 00 00 02 00 00  |."....%.........|
      00000080  00 03 01 00 03 15 02 19  3c 48 0b 68 69 76 65 5f  |........<H.hive_|
      00000090  73 63 68 65 6d 61 15 04  00 15 0c 25 02 18 01 61  |schema.....%...a|
      000000a0  25 00 00 15 06 25 02 18  01 62 00 16 02 19 1c 19  |%....%...b......|
      000000b0  2c 26 08 1c 15 0c 19 35  08 06 00 19 18 01 61 15  |,&.....5......a.|
      000000c0  00 16 02 16 5e 16 5e 26  08 3c 18 04 54 65 73 74  |....^.^&.<..Test|
      000000d0  18 04 54 65 73 74 16 00  00 00 00 26 66 1c 15 06  |..Test.....&f...|
      000000e0  19 35 08 04 06 19 18 01  62 15 00 16 02 16 a4 01  |.5......b.......|
      000000f0  16 a4 01 26 66 3c 18 0c  40 65 76 e0 9f 22 00 00  |...&f<..@ev.."..|
      00000100  8d 84 25 00 18 0c 40 65  76 e0 9f 22 00 00 8d 84  |..%...@ev.."....|
      00000110  25 00 16 00 00 00 00 16  82 02 16 02 00 28 49 70  |%............(Ip|
      00000120  61 72 71 75 65 74 2d 6d  72 20 76 65 72 73 69 6f  |arquet-mr versio|
      00000130  6e 20 31 2e 38 2e 31 20  28 62 75 69 6c 64 20 34  |n 1.8.1 (build 4|
      00000140  61 62 61 34 64 61 65 37  62 62 30 64 34 65 64 62  |aba4dae7bb0d4edb|
      00000150  63 66 37 39 32 33 61 65  31 33 33 39 66 32 38 66  |cf7923ae1339f28f|
      00000160  64 33 66 37 66 63 66 29  00 e4 00 00 00 50 41 52  |d3f7fcf).....PAR|
      00000170  31                                                |1|
      00000171
      
      $ strings 000000_0
      
      PAR1
      Test
      Test
      Test
      hive_schema
      Test
      Test
      (Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
      PAR1
      

      I'll do the same for Presto as it has its own writer as well.

      $ presto --server localhost:8085 --schema default --catalog hive
      
      CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS
          SELECT 'Test' as a,
                 CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b;
      
      SELECT * FROM testing3;
      
        a   |            b
      ------+-------------------------
       Test | 2019-10-08 10:34:30.061
      
      SELECT YEAR(b) FROM testing3;
      
       _col0
      -------
        2019
      
      $ hdfs dfs -copyToLocal /user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 ./
      $ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
      
      00000000  50 41 52 31 15 00 15 1c  15 40 2c 15 02 15 00 15  |PAR1.....@,.....|
      00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
      00000020  16 00 00 00 00 1f 8b 08  00 00 00 00 00 00 03 63  |...............c|
      00000030  62 60 60 60 66 64 01 92  21 a9 c5 25 00 fe 12 da  |b```fd..!..%....|
      00000040  11 0e 00 00 00 15 04 15  18 15 40 4c 15 02 15 04  |..........@L....|
      00000050  00 00 1f 8b 08 00 00 00  00 00 00 03 73 48 2d 7b  |............sH-{|
      00000060  30 5f 89 81 a1 b7 45 95  01 00 64 bc 2a 14 0c 00  |0_....E...d.*...|
      00000070  00 00 15 00 15 10 15 38  2c 15 02 15 04 15 06 15  |.......8,.......|
      00000080  08 1c 18 0c 40 65 76 e0  9f 22 00 00 8d 84 25 00  |....@ev.."....%.|
      00000090  18 0c 40 65 76 e0 9f 22  00 00 8d 84 25 00 16 00  |..@ev.."....%...|
      000000a0  00 00 00 1f 8b 08 00 00  00 00 00 00 03 63 62 60  |.............cb`|
      000000b0  60 60 66 64 60 06 00 77  4c 79 ad 08 00 00 00 15  |``fd`..wLy......|
      000000c0  02 19 3c 48 0b 68 69 76  65 5f 73 63 68 65 6d 61  |..<H.hive_schema|
      000000d0  15 04 00 15 0c 25 02 18  01 61 25 00 00 15 06 25  |.....%...a%....%|
      000000e0  02 18 01 62 00 16 02 19  1c 19 2c 26 08 1c 15 0c  |...b......,&....|
      000000f0  19 35 08 00 06 19 18 01  61 15 04 16 02 16 5e 16  |.5......a.....^.|
      00000100  82 01 26 08 3c 18 04 54  65 73 74 18 04 54 65 73  |..&.<..Test..Tes|
      00000110  74 16 00 00 00 00 26 8a  01 1c 15 06 19 35 04 08  |t.....&......5..|
      00000120  06 19 18 01 62 15 04 16  02 16 a4 01 16 f4 01 26  |....b..........&|
      00000130  8a 01 3c 18 0c 40 65 76  e0 9f 22 00 00 8d 84 25  |..<..@ev.."....%|
      00000140  00 18 0c 40 65 76 e0 9f  22 00 00 8d 84 25 00 16  |...@ev.."....%..|
      00000150  00 00 00 00 16 82 02 16  02 00 28 0a 70 61 72 71  |..........(.parq|
      00000160  75 65 74 2d 6d 72 00 a8  00 00 00 50 41 52 31     |uet-mr.....PAR1|
      0000016f
      
      $ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
      
      PAR1
      Test
      Test
      cb```fd
      sH-{0_
      cb```fd`
      hive_schema
      Test
      Test
      parquet-mr
      PAR1
      

      Any idea how I can save timestamps with Pandas and have them readable by Hive and Presto?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              marklit Mark Litwintschik
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: