[ARROW-7727] [Python] Unable to read a ParquetDataset when schema validation is on. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.15.1
Fix Version/s: 0.16.0
Component/s: Python
Labels:
None
Environment:

Hide
_libgcc_mutex 0.1 main
arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
bleach 3.1.0 py_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.11.28 hecc5488_0 conda-forge
certifi 2019.11.28 py37_0 conda-forge
decorator 4.4.1 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
double-conversion 3.1.5 he1b5a44_2 conda-forge
entrypoints 0.3 py37_1000 conda-forge
gflags 2.2.2 he1b5a44_1002 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
grpc-cpp 1.25.0 h213be95_2 conda-forge
icu 64.2 he1b5a44_1 conda-forge
importlib_metadata 1.4.0 py37_0 conda-forge
inflect 4.0.0 py37_1 conda-forge
ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge
ipython 7.11.1 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jaraco.itertools 5.0.0 py_0 conda-forge
jedi 0.16.0 py37_0 conda-forge
jinja2 2.10.3 py_0 conda-forge
jsonschema 3.2.0 py37_0 conda-forge
jupyter_client 5.3.4 py37_1 conda-forge
jupyter_core 4.6.1 py37_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libedit 3.1.20181209 hc058e9b_0
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_4 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libopenblas 0.3.7 h5ec1e0e_6 conda-forge
libprotobuf 3.11.0 h8b12597_0 conda-forge
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
lz4-c 1.8.3 he1b5a44_1001 conda-forge
markupsafe 1.1.1 py37h516909a_0 conda-forge
mistune 0.8.4 py37h516909a_1000 conda-forge
more-itertools 8.1.0 py_0 conda-forge
nbconvert 5.6.1 py37_0 conda-forge
nbformat 5.0.4 py_0 conda-forge
ncurses 6.1 he6710b0_1
notebook 6.0.3 py37_0 conda-forge
numpy 1.17.5 py37h95a1406_0 conda-forge
openssl 1.1.1d h516909a_0 conda-forge
pandas 0.25.3 py37hb3f55d8_0 conda-forge
pandoc 2.9.1.1 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.6.0 py_0 conda-forge
pexpect 4.8.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pip 20.0.2 py37_0
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 3.0.2 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.15.1 py37h8b68381_1 conda-forge
pygments 2.5.2 py_0 conda-forge
pyrsistent 0.15.7 py37h516909a_0 conda-forge
python 3.7.6 h0371630_2
python-dateutil 2.8.1 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
pyzmq 18.1.1 py37h1768529_0 conda-forge
re2 2020.01.01 he1b5a44_0 conda-forge
readline 7.0 h7b6447c_5
send2trash 1.5.0 py_0 conda-forge
setuptools 45.1.0 py37_0
six 1.14.0 py37_0 conda-forge
snappy 1.1.7 he1b5a44_1003 conda-forge
sqlite 3.30.1 h7b6447c_0
terminado 0.8.3 py37_0 conda-forge
testpath 0.4.4 py_0 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.8 hbc83047_0
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.3 py37_0 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
wcwidth 0.1.8 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.33.6 py37_0
xz 5.2.4 h14c3975_4
zeromq 4.3.2 he1b5a44_2 conda-forge
zipp 2.1.0 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
zstd 1.4.4 h3b9ef0a_1 conda-forge

Show
_libgcc_mutex 0.1 main arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge attrs 19.3.0 py_0 conda-forge backcall 0.1.0 py_0 conda-forge bleach 3.1.0 py_0 conda-forge boost-cpp 1.70.0 h8e57a91_2 conda-forge brotli 1.0.7 he1b5a44_1000 conda-forge bzip2 1.0.8 h516909a_2 conda-forge c-ares 1.15.0 h516909a_1001 conda-forge ca-certificates 2019.11.28 hecc5488_0 conda-forge certifi 2019.11.28 py37_0 conda-forge decorator 4.4.1 py_0 conda-forge defusedxml 0.6.0 py_0 conda-forge double-conversion 3.1.5 he1b5a44_2 conda-forge entrypoints 0.3 py37_1000 conda-forge gflags 2.2.2 he1b5a44_1002 conda-forge glog 0.4.0 he1b5a44_1 conda-forge grpc-cpp 1.25.0 h213be95_2 conda-forge icu 64.2 he1b5a44_1 conda-forge importlib_metadata 1.4.0 py37_0 conda-forge inflect 4.0.0 py37_1 conda-forge ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge ipython 7.11.1 py37h5ca1d4c_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jaraco.itertools 5.0.0 py_0 conda-forge jedi 0.16.0 py37_0 conda-forge jinja2 2.10.3 py_0 conda-forge jsonschema 3.2.0 py37_0 conda-forge jupyter_client 5.3.4 py37_1 conda-forge jupyter_core 4.6.1 py37_0 conda-forge ld_impl_linux-64 2.33.1 h53a641e_7 libblas 3.8.0 14_openblas conda-forge libcblas 3.8.0 14_openblas conda-forge libedit 3.1.20181209 hc058e9b_0 libevent 2.1.10 h72c5cf5_0 conda-forge libffi 3.2.1 hd88cf55_4 libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_4 conda-forge liblapack 3.8.0 14_openblas conda-forge libopenblas 0.3.7 h5ec1e0e_6 conda-forge libprotobuf 3.11.0 h8b12597_0 conda-forge libsodium 1.0.17 h516909a_0 conda-forge libstdcxx-ng 9.1.0 hdf63c60_0 lz4-c 1.8.3 he1b5a44_1001 conda-forge markupsafe 1.1.1 py37h516909a_0 conda-forge mistune 0.8.4 py37h516909a_1000 conda-forge more-itertools 8.1.0 py_0 conda-forge nbconvert 5.6.1 py37_0 conda-forge nbformat 5.0.4 py_0 conda-forge ncurses 6.1 he6710b0_1 notebook 6.0.3 py37_0 conda-forge numpy 1.17.5 py37h95a1406_0 conda-forge openssl 1.1.1d h516909a_0 conda-forge pandas 0.25.3 py37hb3f55d8_0 conda-forge pandoc 2.9.1.1 0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.6.0 py_0 conda-forge pexpect 4.8.0 py37_0 conda-forge pickleshare 0.7.5 py37_1000 conda-forge pip 20.0.2 py37_0 prometheus_client 0.7.1 py_0 conda-forge prompt_toolkit 3.0.2 py_0 conda-forge ptyprocess 0.6.0 py_1001 conda-forge pyarrow 0.15.1 py37h8b68381_1 conda-forge pygments 2.5.2 py_0 conda-forge pyrsistent 0.15.7 py37h516909a_0 conda-forge python 3.7.6 h0371630_2 python-dateutil 2.8.1 py_0 conda-forge pytz 2019.3 py_0 conda-forge pyzmq 18.1.1 py37h1768529_0 conda-forge re2 2020.01.01 he1b5a44_0 conda-forge readline 7.0 h7b6447c_5 send2trash 1.5.0 py_0 conda-forge setuptools 45.1.0 py37_0 six 1.14.0 py37_0 conda-forge snappy 1.1.7 he1b5a44_1003 conda-forge sqlite 3.30.1 h7b6447c_0 terminado 0.8.3 py37_0 conda-forge testpath 0.4.4 py_0 conda-forge thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge tk 8.6.8 hbc83047_0 tornado 6.0.3 py37h516909a_0 conda-forge traitlets 4.3.3 py37_0 conda-forge uriparser 0.9.3 he1b5a44_1 conda-forge wcwidth 0.1.8 py_0 conda-forge webencodings 0.5.1 py_1 conda-forge wheel 0.33.6 py37_0 xz 5.2.4 h14c3975_4 zeromq 4.3.2 he1b5a44_2 conda-forge zipp 2.1.0 py_0 conda-forge zlib 1.2.11 h7b6447c_3 zstd 1.4.4 h3b9ef0a_1 conda-forge

External issue URL:
https://github.com/apache/arrow/issues/23967

Description

I was trying to read a subset of my parquet files using the ParquetDataset object with a predefined schema, when it tries to validate the schema a `to_arrow_schema` is called and the schema does not support this. I don't what is happening, this is a sample.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

schema = pa.schema([
    pa.field("field1", pa.string()),
    pa.field("field2", pa.string()),
    pa.field("field3", pa.string()),
])

 ...

pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'

If we check the type of the schema as defined above we get:

type(schema)
pyarrow.lib.Schema

But the required type according with the docs is `pyarrow.parquet.Schema`, I don't know how to produce a object with this since we are forbbiden to use the Schema constructor directly.

If we check the implementation on github we get directly this line here:

dataset_schema = self.schema.to_arrow_schema()

Is this a problem in the schema builder or the parquet dataset object?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Otávio Vasques

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Jan/20 14:48

Updated:: 11/Jan/23 07:55

Resolved:: 31/Jan/20 00:16