Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11.0
    • Component/s: Storage
    • Labels:
      None
    1. TAJO-1464.patch
      453 kB
      Jongyoung Park

      Issue Links

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tajo/pull/579

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/579
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tajo-master-CODEGEN-build #405 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/405/)
        TAJO-1464: Add ORCFileScanner to read ORCFile table. (jihoonson: rev fa063f0e84d4ce9cb7e690a50a6a269289052779)

        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/HdfsOrcDataSource.java
        • tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/u_data_20.orc
        • tajo-common/src/main/java/org/apache/tajo/util/datetime/DateTimeUtil.java
        • tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto
        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java
        • CHANGES
        • tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java
        • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/orc/TestORCScanner.java
        • tajo-storage/tajo-storage-hdfs/pom.xml
        • tajo-storage/tajo-storage-common/src/main/resources/storage-default.xml
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/FileOrcDataSource.java
        • tajo-storage/tajo-storage-common/src/test/resources/storage-default.xml
        • tajo-common/src/main/java/org/apache/tajo/datum/TimestampDatum.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-CODEGEN-build #405 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/405/ ) TAJO-1464 : Add ORCFileScanner to read ORCFile table. (jihoonson: rev fa063f0e84d4ce9cb7e690a50a6a269289052779) tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/HdfsOrcDataSource.java tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/u_data_20.orc tajo-common/src/main/java/org/apache/tajo/util/datetime/DateTimeUtil.java tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java CHANGES tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/orc/TestORCScanner.java tajo-storage/tajo-storage-hdfs/pom.xml tajo-storage/tajo-storage-common/src/main/resources/storage-default.xml tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/FileOrcDataSource.java tajo-storage/tajo-storage-common/src/test/resources/storage-default.xml tajo-common/src/main/java/org/apache/tajo/datum/TimestampDatum.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tajo-master-build #766 (See https://builds.apache.org/job/Tajo-master-build/766/)
        TAJO-1464: Add ORCFileScanner to read ORCFile table. (jihoonson: rev fa063f0e84d4ce9cb7e690a50a6a269289052779)

        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/HdfsOrcDataSource.java
        • tajo-storage/tajo-storage-common/src/test/resources/storage-default.xml
        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/orc/TestORCScanner.java
        • tajo-storage/tajo-storage-hdfs/pom.xml
        • tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto
        • tajo-common/src/main/java/org/apache/tajo/util/datetime/DateTimeUtil.java
        • tajo-common/src/main/java/org/apache/tajo/datum/TimestampDatum.java
        • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/FileOrcDataSource.java
        • tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java
        • CHANGES
        • tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/u_data_20.orc
        • tajo-storage/tajo-storage-common/src/main/resources/storage-default.xml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-build #766 (See https://builds.apache.org/job/Tajo-master-build/766/ ) TAJO-1464 : Add ORCFileScanner to read ORCFile table. (jihoonson: rev fa063f0e84d4ce9cb7e690a50a6a269289052779) tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/HdfsOrcDataSource.java tajo-storage/tajo-storage-common/src/test/resources/storage-default.xml tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/orc/TestORCScanner.java tajo-storage/tajo-storage-hdfs/pom.xml tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto tajo-common/src/main/java/org/apache/tajo/util/datetime/DateTimeUtil.java tajo-common/src/main/java/org/apache/tajo/datum/TimestampDatum.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/orc/FileOrcDataSource.java tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java CHANGES tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/u_data_20.orc tajo-storage/tajo-storage-common/src/main/resources/storage-default.xml
        Hide
        jihoonson Jihoon Son added a comment -

        Committed to master.
        Thanks for your work!

        Show
        jihoonson Jihoon Son added a comment - Committed to master. Thanks for your work!
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-124004717

        @jihoonson Thanks a lot

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-124004717 @jihoonson Thanks a lot
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jihoonson commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-124004616

        Great! I'll commit now.

        Show
        githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-124004616 Great! I'll commit now.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jihoonson commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-123988901

        I missed +1.

        Show
        githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-123988901 I missed +1.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jihoonson commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-123988846

        @eminency, thanks for your work. Your patch looks good. I left some trivial comments.

        I've also tested your patch on a real cluster. Here is an interesting result.

        ```
        default> \d lineitem_orc_snappy

        table name: default.lineitem_orc_snappy
        store type: ORC
        number of rows: unknown
        volume: 23.6 GB
        Options:

        schema:
        l_orderkey INT8
        l_partkey INT8
        l_suppkey INT8
        l_linenumber INT8
        l_quantity FLOAT8
        l_extendedprice FLOAT8
        l_discount FLOAT8
        l_tax FLOAT8
        l_returnflag TEXT
        l_linestatus TEXT
        l_shipdate DATE
        l_commitdate DATE
        l_receiptdate DATE
        l_shipinstruct TEXT
        l_shipmode TEXT
        l_comment TEXT

        default> select count from lineitem_orc_snappy;
        Progress: 100%, response time: 1.601 sec
        ?count
        -------------------------------
        600037902
        (1 rows, 1.601 sec, 10 B selected)
        ```

        Show
        githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-123988846 @eminency, thanks for your work. Your patch looks good. I left some trivial comments. I've also tested your patch on a real cluster. Here is an interesting result. ``` default> \d lineitem_orc_snappy table name: default.lineitem_orc_snappy store type: ORC number of rows: unknown volume: 23.6 GB Options: schema: l_orderkey INT8 l_partkey INT8 l_suppkey INT8 l_linenumber INT8 l_quantity FLOAT8 l_extendedprice FLOAT8 l_discount FLOAT8 l_tax FLOAT8 l_returnflag TEXT l_linestatus TEXT l_shipdate DATE l_commitdate DATE l_receiptdate DATE l_shipinstruct TEXT l_shipmode TEXT l_comment TEXT default> select count from lineitem_orc_snappy; Progress: 100%, response time: 1.601 sec ?count ------------------------------- 600037902 (1 rows, 1.601 sec, 10 B selected) ```
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jihoonson commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r35288396

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,323 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType())

        { + case INT1: case INT2: case INT4: case INT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + }

        + }
        +
        + private FileSystem fs;
        + private FSDataInputStream fis;
        +
        + private static class ColumnInfo

        { + TajoDataTypes.DataType type; + int id; + }

        +
        + /**
        + * Temporary array for caching column info
        + */
        + private ColumnInfo [] targetColInfo;
        +
        + @Override
        + public void init() throws IOException {
        + OrcReader orcReader;
        +
        + if (targets == null)

        { + targets = schema.toArray(); + }

        +
        + super.init();
        +
        + Path path = fragment.getPath();
        +
        + if(fs == null)

        { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + }

        +
        + if(fis == null)

        { + fis = fs.open(path); + }

        +
        + OrcDataSource orcDataSource = new HdfsOrcDataSource(
        + this.fragment.getPath().toString(),
        + fis,
        + fs.getFileStatus(path).getLen(),
        + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE,
        + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE)));
        +
        + targetColInfo = new ColumnInfo[targets.length];
        + for (int i=0; i<targets.length; i++)

        { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + }

        +
        + // creating vectors for buffering
        + vectors = new Vector[targetColInfo.length];
        + for (int i=0; i<targetColInfo.length; i++)

        { + vectors[i] = createOrcVector(targetColInfo[i].type); + }

        +
        + Set<Integer> columnSet = new HashSet<Integer>();
        + for (ColumnInfo colInfo: targetColInfo)

        { + columnSet.add(colInfo.id); + }

        +
        + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader());
        +
        + // TODO: make OrcPredicate useful
        + // TODO: TimeZone should be from conf
        + // TODO: it might be splittable
        + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE,
        + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault());
        +
        + LOG.debug("file fragment

        { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "}

        ");
        +
        + getNextBatch();
        + }
        +
        + @Override
        + public Tuple next() throws IOException {
        + if (currentPosInBatch == batchSize) {
        + getNextBatch();
        +
        + // EOF
        + if (batchSize == -1)

        { + return null; + }

        + }
        +
        + Tuple tuple = new VTuple(targets.length);
        — End diff –

        After https://issues.apache.org/jira/browse/TAJO-1343, ```tuple``` should be singleton.

        Show
        githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r35288396 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,323 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + } + } + + private FileSystem fs; + private FSDataInputStream fis; + + private static class ColumnInfo { + TajoDataTypes.DataType type; + int id; + } + + /** + * Temporary array for caching column info + */ + private ColumnInfo [] targetColInfo; + + @Override + public void init() throws IOException { + OrcReader orcReader; + + if (targets == null) { + targets = schema.toArray(); + } + + super.init(); + + Path path = fragment.getPath(); + + if(fs == null) { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + } + + if(fis == null) { + fis = fs.open(path); + } + + OrcDataSource orcDataSource = new HdfsOrcDataSource( + this.fragment.getPath().toString(), + fis, + fs.getFileStatus(path).getLen(), + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE, + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE))); + + targetColInfo = new ColumnInfo [targets.length] ; + for (int i=0; i<targets.length; i++) { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + } + + // creating vectors for buffering + vectors = new Vector [targetColInfo.length] ; + for (int i=0; i<targetColInfo.length; i++) { + vectors[i] = createOrcVector(targetColInfo[i].type); + } + + Set<Integer> columnSet = new HashSet<Integer>(); + for (ColumnInfo colInfo: targetColInfo) { + columnSet.add(colInfo.id); + } + + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader()); + + // TODO: make OrcPredicate useful + // TODO: TimeZone should be from conf + // TODO: it might be splittable + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE, + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault()); + + LOG.debug("file fragment { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "} "); + + getNextBatch(); + } + + @Override + public Tuple next() throws IOException { + if (currentPosInBatch == batchSize) { + getNextBatch(); + + // EOF + if (batchSize == -1) { + return null; + } + } + + Tuple tuple = new VTuple(targets.length); — End diff – After https://issues.apache.org/jira/browse/TAJO-1343 , ```tuple``` should be singleton.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jihoonson commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r35288012

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,323 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1: case INT2: case INT4: case INT8:
        + case INET4:
        + case TIMESTAMP:
        + case DATE:
        + return new LongVector();
        +
        + case FLOAT4:
        + case FLOAT8:
        + return new DoubleVector();
        +
        + case BOOLEAN:
        + case NULL_TYPE:
        + return new BooleanVector();
        +
        + case BLOB:
        + case TEXT:
        + case CHAR:
        + case PROTOBUF:
        + return new SliceVector();
        +
        + default:
        + throw new UnsupportedException("This data type is not supported currently: "+type.toString());
        — End diff –

        According to the recent changes in https://issues.apache.org/jira/browse/TAJO-1625, ```UmimplementedException``` looks more approprite.

        Show
        githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r35288012 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,323 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); — End diff – According to the recent changes in https://issues.apache.org/jira/browse/TAJO-1625 , ```UmimplementedException``` looks more approprite.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r35062642

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1: case INT2: case INT4: case INT8:
        + case UINT1: case UINT2: case UINT4: case UINT8:
        — End diff –

        I see, I will.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r35062642 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: — End diff – I see, I will.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r35062424

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1: case INT2: case INT4: case INT8:
        + case UINT1: case UINT2: case UINT4: case UINT8:
        + case INET4:
        + case TIMESTAMP:
        — End diff –

        It is Tajo type. It is stored as integer type in ORC file.
        The code is just for converting.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r35062424 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: — End diff – It is Tajo type. It is stored as integer type in ORC file. The code is just for converting.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-122869416

        I leave some trivial comments.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-122869416 I leave some trivial comments.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34989058

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType())

        { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + }

        + }
        +
        + private FileSystem fs;
        + private FSDataInputStream fis;
        +
        + private static class ColumnInfo

        { + TajoDataTypes.DataType type; + int id; + }

        +
        + /**
        + * Temporary array for caching column info
        + */
        + private ColumnInfo [] targetColInfo;
        +
        + @Override
        + public void init() throws IOException {
        + OrcReader orcReader;
        +
        + if (targets == null)

        { + targets = schema.toArray(); + }

        +
        + super.init();
        +
        + Path path = fragment.getPath();
        +
        + if(fs == null)

        { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + }

        +
        + if(fis == null)

        { + fis = fs.open(path); + }

        +
        + OrcDataSource orcDataSource = new HdfsOrcDataSource(
        + this.fragment.getPath().toString(),
        + fis,
        + fs.getFileStatus(path).getLen(),
        + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE,
        + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE)));
        +
        + targetColInfo = new ColumnInfo[targets.length];
        + for (int i=0; i<targets.length; i++)

        { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + }

        +
        + // creating vectors for buffering
        + vectors = new Vector[targetColInfo.length];
        + for (int i=0; i<targetColInfo.length; i++)

        { + vectors[i] = createOrcVector(targetColInfo[i].type); + }

        +
        + Set<Integer> columnSet = new HashSet<Integer>();
        + for (ColumnInfo colInfo: targetColInfo)

        { + columnSet.add(colInfo.id); + }

        +
        + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader());
        +
        + // TODO: make OrcPredicate useful
        + // TODO: TimeZone should be from conf
        + // TODO: it might be splittable
        + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE,
        + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault());
        +
        + LOG.debug("file fragment

        { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "}

        ");
        +
        + getNextBatch();
        + }
        +
        + @Override
        + public Tuple next() throws IOException {
        + if (currentPosInBatch == batchSize) {
        + getNextBatch();
        +
        + // EOF
        + if (batchSize == -1)

        { + return null; + }

        + }
        +
        + Tuple tuple = new VTuple(targets.length);
        +
        + for (int i=0; i<targetColInfo.length; i++)

        { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + }

        +
        + currentPosInBatch++;
        +
        + return tuple;
        + }
        +
        + // TODO: support more types
        + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1:
        + case UINT1:
        + case INT2:
        + case UINT2:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt2((short) ((LongVector) vector).vector[currentPosInBatch]);
        +
        + case INT4:
        + case UINT4:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt4((int) ((LongVector) vector).vector[currentPosInBatch]);
        +
        + case INT8:
        + case UINT8:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt8(((LongVector) vector).vector[currentPosInBatch]);
        +
        + case FLOAT4:
        + if (((DoubleVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createFloat4((float) ((DoubleVector) vector).vector[currentPosInBatch]);
        +
        + case FLOAT8:
        + if (((DoubleVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createFloat8(((DoubleVector) vector).vector[currentPosInBatch]);
        +
        + case BOOLEAN:
        + if (((BooleanVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return ((BooleanVector) vector).vector[currentPosInBatch] ? BooleanDatum.TRUE : BooleanDatum.FALSE;
        +
        + case CHAR:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createChar(((SliceVector) vector).vector[currentPosInBatch].toStringUtf8());
        +
        + case TEXT:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createText(((SliceVector) vector).vector[currentPosInBatch].getBytes());
        +
        + case BLOB:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createBlob(((SliceVector) vector).vector[currentPosInBatch].getBytes());
        +
        + case TIMESTAMP:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createTimestamp(
        + DateTimeUtil.javaTimeToJulianTime(((LongVector) vector).vector[currentPosInBatch]));
        +
        + case DATE:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createDate(
        + (int) ((LongVector) vector).vector[currentPosInBatch] + DateTimeUtil.DAYS_FROM_JULIAN_TO_EPOCH);
        +
        + case INET4:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        — End diff –

        ORC may be not support INET4 type.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34989058 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + } + } + + private FileSystem fs; + private FSDataInputStream fis; + + private static class ColumnInfo { + TajoDataTypes.DataType type; + int id; + } + + /** + * Temporary array for caching column info + */ + private ColumnInfo [] targetColInfo; + + @Override + public void init() throws IOException { + OrcReader orcReader; + + if (targets == null) { + targets = schema.toArray(); + } + + super.init(); + + Path path = fragment.getPath(); + + if(fs == null) { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + } + + if(fis == null) { + fis = fs.open(path); + } + + OrcDataSource orcDataSource = new HdfsOrcDataSource( + this.fragment.getPath().toString(), + fis, + fs.getFileStatus(path).getLen(), + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE, + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE))); + + targetColInfo = new ColumnInfo [targets.length] ; + for (int i=0; i<targets.length; i++) { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + } + + // creating vectors for buffering + vectors = new Vector [targetColInfo.length] ; + for (int i=0; i<targetColInfo.length; i++) { + vectors[i] = createOrcVector(targetColInfo[i].type); + } + + Set<Integer> columnSet = new HashSet<Integer>(); + for (ColumnInfo colInfo: targetColInfo) { + columnSet.add(colInfo.id); + } + + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader()); + + // TODO: make OrcPredicate useful + // TODO: TimeZone should be from conf + // TODO: it might be splittable + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE, + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault()); + + LOG.debug("file fragment { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "} "); + + getNextBatch(); + } + + @Override + public Tuple next() throws IOException { + if (currentPosInBatch == batchSize) { + getNextBatch(); + + // EOF + if (batchSize == -1) { + return null; + } + } + + Tuple tuple = new VTuple(targets.length); + + for (int i=0; i<targetColInfo.length; i++) { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + } + + currentPosInBatch++; + + return tuple; + } + + // TODO: support more types + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: + case UINT1: + case INT2: + case UINT2: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt2((short) ((LongVector) vector).vector [currentPosInBatch] ); + + case INT4: + case UINT4: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt4((int) ((LongVector) vector).vector [currentPosInBatch] ); + + case INT8: + case UINT8: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt8(((LongVector) vector).vector [currentPosInBatch] ); + + case FLOAT4: + if (((DoubleVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createFloat4((float) ((DoubleVector) vector).vector [currentPosInBatch] ); + + case FLOAT8: + if (((DoubleVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createFloat8(((DoubleVector) vector).vector [currentPosInBatch] ); + + case BOOLEAN: + if (((BooleanVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return ((BooleanVector) vector).vector [currentPosInBatch] ? BooleanDatum.TRUE : BooleanDatum.FALSE; + + case CHAR: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createChar(((SliceVector) vector).vector [currentPosInBatch] .toStringUtf8()); + + case TEXT: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createText(((SliceVector) vector).vector [currentPosInBatch] .getBytes()); + + case BLOB: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createBlob(((SliceVector) vector).vector [currentPosInBatch] .getBytes()); + + case TIMESTAMP: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createTimestamp( + DateTimeUtil.javaTimeToJulianTime(((LongVector) vector).vector [currentPosInBatch] )); + + case DATE: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createDate( + (int) ((LongVector) vector).vector [currentPosInBatch] + DateTimeUtil.DAYS_FROM_JULIAN_TO_EPOCH); + + case INET4: + if (((LongVector) vector).isNull [currentPosInBatch] ) — End diff – ORC may be not support INET4 type.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34989027

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType())

        { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + }

        + }
        +
        + private FileSystem fs;
        + private FSDataInputStream fis;
        +
        + private static class ColumnInfo

        { + TajoDataTypes.DataType type; + int id; + }

        +
        + /**
        + * Temporary array for caching column info
        + */
        + private ColumnInfo [] targetColInfo;
        +
        + @Override
        + public void init() throws IOException {
        + OrcReader orcReader;
        +
        + if (targets == null)

        { + targets = schema.toArray(); + }

        +
        + super.init();
        +
        + Path path = fragment.getPath();
        +
        + if(fs == null)

        { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + }

        +
        + if(fis == null)

        { + fis = fs.open(path); + }

        +
        + OrcDataSource orcDataSource = new HdfsOrcDataSource(
        + this.fragment.getPath().toString(),
        + fis,
        + fs.getFileStatus(path).getLen(),
        + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE,
        + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE)));
        +
        + targetColInfo = new ColumnInfo[targets.length];
        + for (int i=0; i<targets.length; i++)

        { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + }

        +
        + // creating vectors for buffering
        + vectors = new Vector[targetColInfo.length];
        + for (int i=0; i<targetColInfo.length; i++)

        { + vectors[i] = createOrcVector(targetColInfo[i].type); + }

        +
        + Set<Integer> columnSet = new HashSet<Integer>();
        + for (ColumnInfo colInfo: targetColInfo)

        { + columnSet.add(colInfo.id); + }

        +
        + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader());
        +
        + // TODO: make OrcPredicate useful
        + // TODO: TimeZone should be from conf
        + // TODO: it might be splittable
        + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE,
        + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault());
        +
        + LOG.debug("file fragment

        { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "}

        ");
        +
        + getNextBatch();
        + }
        +
        + @Override
        + public Tuple next() throws IOException {
        + if (currentPosInBatch == batchSize) {
        + getNextBatch();
        +
        + // EOF
        + if (batchSize == -1)

        { + return null; + }

        + }
        +
        + Tuple tuple = new VTuple(targets.length);
        +
        + for (int i=0; i<targetColInfo.length; i++)

        { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + }

        +
        + currentPosInBatch++;
        +
        + return tuple;
        + }
        +
        + // TODO: support more types
        + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1:
        + case UINT1:
        + case INT2:
        + case UINT2:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt2((short) ((LongVector) vector).vector[currentPosInBatch]);
        +
        + case INT4:
        + case UINT4:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt4((int) ((LongVector) vector).vector[currentPosInBatch]);
        +
        + case INT8:
        + case UINT8:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createInt8(((LongVector) vector).vector[currentPosInBatch]);
        +
        + case FLOAT4:
        + if (((DoubleVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createFloat4((float) ((DoubleVector) vector).vector[currentPosInBatch]);
        +
        + case FLOAT8:
        + if (((DoubleVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return DatumFactory.createFloat8(((DoubleVector) vector).vector[currentPosInBatch]);
        +
        + case BOOLEAN:
        + if (((BooleanVector) vector).isNull[currentPosInBatch])
        + return NullDatum.get();
        +
        + return ((BooleanVector) vector).vector[currentPosInBatch] ? BooleanDatum.TRUE : BooleanDatum.FALSE;
        +
        + case CHAR:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createChar(((SliceVector) vector).vector[currentPosInBatch].toStringUtf8());
        +
        + case TEXT:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createText(((SliceVector) vector).vector[currentPosInBatch].getBytes());
        +
        + case BLOB:
        + if (((SliceVector) vector).vector[currentPosInBatch] == null)
        + return NullDatum.get();
        +
        + return DatumFactory.createBlob(((SliceVector) vector).vector[currentPosInBatch].getBytes());
        +
        + case TIMESTAMP:
        + if (((LongVector) vector).isNull[currentPosInBatch])
        — End diff –

        You need to check how ORC stores timestamp values.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34989027 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + } + } + + private FileSystem fs; + private FSDataInputStream fis; + + private static class ColumnInfo { + TajoDataTypes.DataType type; + int id; + } + + /** + * Temporary array for caching column info + */ + private ColumnInfo [] targetColInfo; + + @Override + public void init() throws IOException { + OrcReader orcReader; + + if (targets == null) { + targets = schema.toArray(); + } + + super.init(); + + Path path = fragment.getPath(); + + if(fs == null) { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + } + + if(fis == null) { + fis = fs.open(path); + } + + OrcDataSource orcDataSource = new HdfsOrcDataSource( + this.fragment.getPath().toString(), + fis, + fs.getFileStatus(path).getLen(), + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE, + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE))); + + targetColInfo = new ColumnInfo [targets.length] ; + for (int i=0; i<targets.length; i++) { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + } + + // creating vectors for buffering + vectors = new Vector [targetColInfo.length] ; + for (int i=0; i<targetColInfo.length; i++) { + vectors[i] = createOrcVector(targetColInfo[i].type); + } + + Set<Integer> columnSet = new HashSet<Integer>(); + for (ColumnInfo colInfo: targetColInfo) { + columnSet.add(colInfo.id); + } + + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader()); + + // TODO: make OrcPredicate useful + // TODO: TimeZone should be from conf + // TODO: it might be splittable + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE, + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault()); + + LOG.debug("file fragment { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "} "); + + getNextBatch(); + } + + @Override + public Tuple next() throws IOException { + if (currentPosInBatch == batchSize) { + getNextBatch(); + + // EOF + if (batchSize == -1) { + return null; + } + } + + Tuple tuple = new VTuple(targets.length); + + for (int i=0; i<targetColInfo.length; i++) { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + } + + currentPosInBatch++; + + return tuple; + } + + // TODO: support more types + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: + case UINT1: + case INT2: + case UINT2: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt2((short) ((LongVector) vector).vector [currentPosInBatch] ); + + case INT4: + case UINT4: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt4((int) ((LongVector) vector).vector [currentPosInBatch] ); + + case INT8: + case UINT8: + if (((LongVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createInt8(((LongVector) vector).vector [currentPosInBatch] ); + + case FLOAT4: + if (((DoubleVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createFloat4((float) ((DoubleVector) vector).vector [currentPosInBatch] ); + + case FLOAT8: + if (((DoubleVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return DatumFactory.createFloat8(((DoubleVector) vector).vector [currentPosInBatch] ); + + case BOOLEAN: + if (((BooleanVector) vector).isNull [currentPosInBatch] ) + return NullDatum.get(); + + return ((BooleanVector) vector).vector [currentPosInBatch] ? BooleanDatum.TRUE : BooleanDatum.FALSE; + + case CHAR: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createChar(((SliceVector) vector).vector [currentPosInBatch] .toStringUtf8()); + + case TEXT: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createText(((SliceVector) vector).vector [currentPosInBatch] .getBytes()); + + case BLOB: + if (((SliceVector) vector).vector [currentPosInBatch] == null) + return NullDatum.get(); + + return DatumFactory.createBlob(((SliceVector) vector).vector [currentPosInBatch] .getBytes()); + + case TIMESTAMP: + if (((LongVector) vector).isNull [currentPosInBatch] ) — End diff – You need to check how ORC stores timestamp values.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34988978

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType())

        { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + }

        + }
        +
        + private FileSystem fs;
        + private FSDataInputStream fis;
        +
        + private static class ColumnInfo

        { + TajoDataTypes.DataType type; + int id; + }

        +
        + /**
        + * Temporary array for caching column info
        + */
        + private ColumnInfo [] targetColInfo;
        +
        + @Override
        + public void init() throws IOException {
        + OrcReader orcReader;
        +
        + if (targets == null)

        { + targets = schema.toArray(); + }

        +
        + super.init();
        +
        + Path path = fragment.getPath();
        +
        + if(fs == null)

        { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + }

        +
        + if(fis == null)

        { + fis = fs.open(path); + }

        +
        + OrcDataSource orcDataSource = new HdfsOrcDataSource(
        + this.fragment.getPath().toString(),
        + fis,
        + fs.getFileStatus(path).getLen(),
        + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE,
        + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE)));
        +
        + targetColInfo = new ColumnInfo[targets.length];
        + for (int i=0; i<targets.length; i++)

        { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + }

        +
        + // creating vectors for buffering
        + vectors = new Vector[targetColInfo.length];
        + for (int i=0; i<targetColInfo.length; i++)

        { + vectors[i] = createOrcVector(targetColInfo[i].type); + }

        +
        + Set<Integer> columnSet = new HashSet<Integer>();
        + for (ColumnInfo colInfo: targetColInfo)

        { + columnSet.add(colInfo.id); + }

        +
        + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader());
        +
        + // TODO: make OrcPredicate useful
        + // TODO: TimeZone should be from conf
        + // TODO: it might be splittable
        + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE,
        + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault());
        +
        + LOG.debug("file fragment

        { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "}

        ");
        +
        + getNextBatch();
        + }
        +
        + @Override
        + public Tuple next() throws IOException {
        + if (currentPosInBatch == batchSize) {
        + getNextBatch();
        +
        + // EOF
        + if (batchSize == -1)

        { + return null; + }

        + }
        +
        + Tuple tuple = new VTuple(targets.length);
        +
        + for (int i=0; i<targetColInfo.length; i++)

        { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + }

        +
        + currentPosInBatch++;
        +
        + return tuple;
        + }
        +
        + // TODO: support more types
        + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1:
        + case UINT1:
        — End diff –

        As I mentioned above, unsigned types can be omitted because they are not supported.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34988978 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: + case DATE: + return new LongVector(); + + case FLOAT4: + case FLOAT8: + return new DoubleVector(); + + case BOOLEAN: + case NULL_TYPE: + return new BooleanVector(); + + case BLOB: + case TEXT: + case CHAR: + case PROTOBUF: + return new SliceVector(); + + default: + throw new UnsupportedException("This data type is not supported currently: "+type.toString()); + } + } + + private FileSystem fs; + private FSDataInputStream fis; + + private static class ColumnInfo { + TajoDataTypes.DataType type; + int id; + } + + /** + * Temporary array for caching column info + */ + private ColumnInfo [] targetColInfo; + + @Override + public void init() throws IOException { + OrcReader orcReader; + + if (targets == null) { + targets = schema.toArray(); + } + + super.init(); + + Path path = fragment.getPath(); + + if(fs == null) { + fs = FileScanner.getFileSystem((TajoConf)conf, path); + } + + if(fis == null) { + fis = fs.open(path); + } + + OrcDataSource orcDataSource = new HdfsOrcDataSource( + this.fragment.getPath().toString(), + fis, + fs.getFileStatus(path).getLen(), + Integer.parseInt(meta.getOption(StorageConstants.ORC_MAX_MERGE_DISTANCE, + StorageConstants.DEFAULT_ORC_MAX_MERGE_DISTANCE))); + + targetColInfo = new ColumnInfo [targets.length] ; + for (int i=0; i<targets.length; i++) { + ColumnInfo cinfo = new ColumnInfo(); + cinfo.type = targets[i].getDataType(); + cinfo.id = schema.getColumnId(targets[i].getQualifiedName()); + targetColInfo[i] = cinfo; + } + + // creating vectors for buffering + vectors = new Vector [targetColInfo.length] ; + for (int i=0; i<targetColInfo.length; i++) { + vectors[i] = createOrcVector(targetColInfo[i].type); + } + + Set<Integer> columnSet = new HashSet<Integer>(); + for (ColumnInfo colInfo: targetColInfo) { + columnSet.add(colInfo.id); + } + + orcReader = new OrcReader(orcDataSource, new OrcMetadataReader()); + + // TODO: make OrcPredicate useful + // TODO: TimeZone should be from conf + // TODO: it might be splittable + recordReader = orcReader.createRecordReader(columnSet, OrcPredicate.TRUE, + fragment.getStartKey(), fragment.getLength(), DateTimeZone.getDefault()); + + LOG.debug("file fragment { path: " + fragment.getPath() + + ", start offset: " + fragment.getStartKey() + + ", length: " + fragment.getLength() + "} "); + + getNextBatch(); + } + + @Override + public Tuple next() throws IOException { + if (currentPosInBatch == batchSize) { + getNextBatch(); + + // EOF + if (batchSize == -1) { + return null; + } + } + + Tuple tuple = new VTuple(targets.length); + + for (int i=0; i<targetColInfo.length; i++) { + tuple.put(i, createValueDatum(vectors[i], targetColInfo[i].type)); + } + + currentPosInBatch++; + + return tuple; + } + + // TODO: support more types + private Datum createValueDatum(Vector vector, TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: + case UINT1: — End diff – As I mentioned above, unsigned types can be omitted because they are not supported.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34988886

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1: case INT2: case INT4: case INT8:
        + case UINT1: case UINT2: case UINT4: case UINT8:
        + case INET4:
        + case TIMESTAMP:
        — End diff –

        Does ORC use long type for timestamp, date, and INET4?

        As far as I know, Tajo represents INET4 or date values as a integer value and timestamp as a long value. So, if you use LongVector for those types, it won't be compatible.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34988886 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: + case INET4: + case TIMESTAMP: — End diff – Does ORC use long type for timestamp, date, and INET4? As far as I know, Tajo represents INET4 or date values as a integer value and timestamp as a long value. So, if you use LongVector for those types, it won't be compatible.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34988715

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java —
        @@ -0,0 +1,328 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import com.google.protobuf.InvalidProtocolBufferException;
        +import org.apache.commons.logging.Log;
        +import org.apache.commons.logging.LogFactory;
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.StorageConstants;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class ORCScanner extends FileScanner {
        + private static final Log LOG = LogFactory.getLog(ORCScanner.class);
        + private OrcRecordReader recordReader;
        + private Vector [] vectors;
        + private int currentPosInBatch = 0;
        + private int batchSize = 0;
        +
        + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment)

        { + super(conf, schema, meta, fragment); + }

        +
        + private Vector createOrcVector(TajoDataTypes.DataType type) {
        + switch (type.getType()) {
        + case INT1: case INT2: case INT4: case INT8:
        + case UINT1: case UINT2: case UINT4: case UINT8:
        — End diff –

        Unsigned types are not used in the current implementation. So, you can omit them.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34988715 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/ORCScanner.java — @@ -0,0 +1,328 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import com.google.protobuf.InvalidProtocolBufferException; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.StorageConstants; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class ORCScanner extends FileScanner { + private static final Log LOG = LogFactory.getLog(ORCScanner.class); + private OrcRecordReader recordReader; + private Vector [] vectors; + private int currentPosInBatch = 0; + private int batchSize = 0; + + public ORCScanner(Configuration conf, final Schema schema, final TableMeta meta, final Fragment fragment) { + super(conf, schema, meta, fragment); + } + + private Vector createOrcVector(TajoDataTypes.DataType type) { + switch (type.getType()) { + case INT1: case INT2: case INT4: case INT8: + case UINT1: case UINT2: case UINT4: case UINT8: — End diff – Unsigned types are not used in the current implementation. So, you can omit them.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-120848120

        Yes, please.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-120848120 Yes, please.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-120846082

        Hi @eminency

        Thank you for your response.
        Could I start to review current patch?

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-120846082 Hi @eminency Thank you for your response. Could I start to review current patch?
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-120845348

        Hi, @blrunner.

        About TestStorage, it requires Appender class, too. So it may be added in ORCAppender patch later.

        Please review at current state.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-120845348 Hi, @blrunner. About TestStorage, it requires Appender class, too. So it may be added in ORCAppender patch later. Please review at current state.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34428789

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/OrcScanner.java —
        @@ -0,0 +1,257 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.ColumnStatistics;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Map;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class OrcScanner extends FileScanner {
        — End diff –

        Ok, I will do it.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34428789 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/OrcScanner.java — @@ -0,0 +1,257 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.ColumnStatistics; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Map; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class OrcScanner extends FileScanner { — End diff – Ok, I will do it.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user eminency commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-120780389

        Hi, @blrunner

        1. Presto-orc is tested and stable. And it is a bit more faster than Hive. Secondly, Hive processes ORC with row-oriented, but Presto does with column-oriented. Because ORC is columnar format, Presto's way may be more suitable. Though Tajo processes it with tuple-oriented, it can be helpful when Tajo will change to use offheap in the future.

        2. Sure.
        Anyway, *please hold the review*. About new storage manager code, I'm modifying.

        Show
        githubbot ASF GitHub Bot added a comment - Github user eminency commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-120780389 Hi, @blrunner 1. Presto-orc is tested and stable. And it is a bit more faster than Hive. Secondly, Hive processes ORC with row-oriented, but Presto does with column-oriented. Because ORC is columnar format, Presto's way may be more suitable. Though Tajo processes it with tuple-oriented, it can be helpful when Tajo will change to use offheap in the future. 2. Sure. Anyway, * please hold the review *. About new storage manager code, I'm modifying.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-120705385

        1. Could you explain me why you use presto-orc?
        2. You need to add unit test cases to TestStorages in tajo-storage-hdfs module.

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-120705385 1. Could you explain me why you use presto-orc? 2. You need to add unit test cases to TestStorages in tajo-storage-hdfs module.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/579#discussion_r34420046

        — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/OrcScanner.java —
        @@ -0,0 +1,257 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.tajo.storage.orc;
        +
        +import org.apache.hadoop.conf.Configuration;
        +import org.apache.hadoop.fs.FSDataInputStream;
        +import org.apache.hadoop.fs.FileSystem;
        +import org.apache.hadoop.fs.Path;
        +import org.apache.tajo.catalog.Schema;
        +import org.apache.tajo.catalog.TableMeta;
        +import org.apache.tajo.common.TajoDataTypes;
        +import org.apache.tajo.conf.TajoConf;
        +import org.apache.tajo.datum.*;
        +import org.apache.tajo.exception.UnsupportedException;
        +import org.apache.tajo.plan.expr.EvalNode;
        +import org.apache.tajo.storage.FileScanner;
        +import org.apache.tajo.storage.Tuple;
        +import org.apache.tajo.storage.VTuple;
        +import org.apache.tajo.storage.fragment.Fragment;
        +import com.facebook.presto.orc.*;
        +import com.facebook.presto.orc.metadata.ColumnStatistics;
        +import com.facebook.presto.orc.metadata.OrcMetadataReader;
        +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource;
        +import org.apache.tajo.util.datetime.DateTimeUtil;
        +import org.joda.time.DateTimeZone;
        +
        +import java.io.IOException;
        +import java.util.HashSet;
        +import java.util.Map;
        +import java.util.Set;
        +
        +/**
        + * OrcScanner for reading ORC files
        + */
        +public class OrcScanner extends FileScanner {
        — End diff –

        How about renaming it ORCScanner? Currently, the scanner for RCFile is named RCScanner.

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on a diff in the pull request: https://github.com/apache/tajo/pull/579#discussion_r34420046 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/orc/OrcScanner.java — @@ -0,0 +1,257 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.tajo.storage.orc; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.tajo.catalog.Schema; +import org.apache.tajo.catalog.TableMeta; +import org.apache.tajo.common.TajoDataTypes; +import org.apache.tajo.conf.TajoConf; +import org.apache.tajo.datum.*; +import org.apache.tajo.exception.UnsupportedException; +import org.apache.tajo.plan.expr.EvalNode; +import org.apache.tajo.storage.FileScanner; +import org.apache.tajo.storage.Tuple; +import org.apache.tajo.storage.VTuple; +import org.apache.tajo.storage.fragment.Fragment; +import com.facebook.presto.orc.*; +import com.facebook.presto.orc.metadata.ColumnStatistics; +import com.facebook.presto.orc.metadata.OrcMetadataReader; +import org.apache.tajo.storage.thirdparty.orc.HdfsOrcDataSource; +import org.apache.tajo.util.datetime.DateTimeUtil; +import org.joda.time.DateTimeZone; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Map; +import java.util.Set; + +/** + * OrcScanner for reading ORC files + */ +public class OrcScanner extends FileScanner { — End diff – How about renaming it ORCScanner? Currently, the scanner for RCFile is named RCScanner.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on the pull request:

        https://github.com/apache/tajo/pull/579#issuecomment-114064384

        Hi @eminency

        Thanks for your contribution.
        If you add ORC type to HiveCatalogStore, the patch would be more better.

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/579#issuecomment-114064384 Hi @eminency Thanks for your contribution. If you add ORC type to HiveCatalogStore, the patch would be more better.
        Hide
        tajoqa Tajo QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12734369/TAJO-1464.patch
        against master revision release-0.9.0-rc0-304-g5264156.

        -1 patch. The patch command could not apply the patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/796//console

        This message is automatically generated.

        Show
        tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734369/TAJO-1464.patch against master revision release-0.9.0-rc0-304-g5264156. -1 patch. The patch command could not apply the patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/796//console This message is automatically generated.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user eminency opened a pull request:

        https://github.com/apache/tajo/pull/579

        TAJO-1464: Add ORCFileScanner to read ORCFile table

        This patch is to support to read ORC file in TAJO.

        The code under 'thirdparty' is from Presto, thus you don't have to look into it deeply.

        Adding more data type to support is leaved as future work.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/eminency/tajo TAJO-1464

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tajo/pull/579.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #579


        commit aa6651b74541e424eb6fc895f92c14838d8ca232
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-04-12T23:14:15Z

        initial ORC scanner

        commit 7b057d27b0b92a8282a6833e667cca8848f8691f
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-14T06:49:18Z

        ORC fundamental code importing from Presto

        commit fac7029139f0f72a6b3e8d040e71e4afc4027fbf
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-18T11:34:45Z

        Sources based JDK 1.7 are applied from Presto

        commit 015e40d10d86c5d3b2c37ac3531e32945aa55f14
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T03:04:11Z

        HdfsOrcDataSource constructor is changed to receive double instead of DataSize

        commit 4952195d4ea5dfcd2fa7c08c143477c3b446c13b
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T05:34:34Z

        Initial OrcScanner

        commit e0ee926644cb938f20332f03abedb5552af79dfe
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T06:05:52Z

        Close code error fixed

        commit 5706b108d76fb7b055e6fe601eb9f85e56da75fe
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T06:57:53Z

        Creating vectors missed

        commit 75b8a83ccbb639396e8df0da0ba7c06f160c3dc3
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T08:37:51Z

        Add comment

        commit 45641e04f708588fbc45bffd40b90c4d1b5266a8
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-21T07:21:02Z

        FileOrcDataSource constructor modified

        commit e7cd698f0cc9094c1ec2d1f3d7b7fb34102bf6b6
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-21T07:21:36Z

        Supporting timestamp

        commit e839228bfb49454d3a6d5b3b22550c14b96e42b9
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-21T07:36:53Z

        OrcScaner test added

        commit 28d7f1f54724d8f517b660e9833b91b241fca460
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-20T05:52:22Z

        Added orc row in storage-default.xml

        commit 59970024de3ad02ef2ca24945f48bb39d89872ae
        Author: Dongjoon Hyun <dongjoon@apache.org>
        Date: 2015-03-29T07:14:11Z

        TAJO-1463: Add ORCFile store type to create ORCFile table

        commit e01d45e6ac6b71dc9b11eaaaa02126f5f52ebd38
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-21T07:46:20Z

        compile error fixed

        commit ae613a704f6a216899144a73a4e09bdab9b3de77
        Author: Jongyoung Park <eminency@gmail.com>
        Date: 2015-05-21T08:10:12Z

        TimestampDatum comment fixed


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user eminency opened a pull request: https://github.com/apache/tajo/pull/579 TAJO-1464 : Add ORCFileScanner to read ORCFile table This patch is to support to read ORC file in TAJO. The code under 'thirdparty' is from Presto, thus you don't have to look into it deeply. Adding more data type to support is leaved as future work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eminency/tajo TAJO-1464 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/579.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #579 commit aa6651b74541e424eb6fc895f92c14838d8ca232 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-04-12T23:14:15Z initial ORC scanner commit 7b057d27b0b92a8282a6833e667cca8848f8691f Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-14T06:49:18Z ORC fundamental code importing from Presto commit fac7029139f0f72a6b3e8d040e71e4afc4027fbf Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-18T11:34:45Z Sources based JDK 1.7 are applied from Presto commit 015e40d10d86c5d3b2c37ac3531e32945aa55f14 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T03:04:11Z HdfsOrcDataSource constructor is changed to receive double instead of DataSize commit 4952195d4ea5dfcd2fa7c08c143477c3b446c13b Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T05:34:34Z Initial OrcScanner commit e0ee926644cb938f20332f03abedb5552af79dfe Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T06:05:52Z Close code error fixed commit 5706b108d76fb7b055e6fe601eb9f85e56da75fe Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T06:57:53Z Creating vectors missed commit 75b8a83ccbb639396e8df0da0ba7c06f160c3dc3 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T08:37:51Z Add comment commit 45641e04f708588fbc45bffd40b90c4d1b5266a8 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-21T07:21:02Z FileOrcDataSource constructor modified commit e7cd698f0cc9094c1ec2d1f3d7b7fb34102bf6b6 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-21T07:21:36Z Supporting timestamp commit e839228bfb49454d3a6d5b3b22550c14b96e42b9 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-21T07:36:53Z OrcScaner test added commit 28d7f1f54724d8f517b660e9833b91b241fca460 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-20T05:52:22Z Added orc row in storage-default.xml commit 59970024de3ad02ef2ca24945f48bb39d89872ae Author: Dongjoon Hyun <dongjoon@apache.org> Date: 2015-03-29T07:14:11Z TAJO-1463 : Add ORCFile store type to create ORCFile table commit e01d45e6ac6b71dc9b11eaaaa02126f5f52ebd38 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-21T07:46:20Z compile error fixed commit ae613a704f6a216899144a73a4e09bdab9b3de77 Author: Jongyoung Park <eminency@gmail.com> Date: 2015-05-21T08:10:12Z TimestampDatum comment fixed

          People

          • Assignee:
            eminency Jongyoung Park
            Reporter:
            dongjoon Dongjoon Hyun
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development