Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: encoding, Java, MapReduce
    • Labels:
      None
    1. ORC-1.diff
      14 kB
      Owen O'Malley
    2. ORC-1.patch
      1.68 MB
      Owen O'Malley

      Issue Links

        Activity

        Hide
        owen.omalley Owen O'Malley added a comment -

        We need to have Hive refactored correctly so that this can happen smoothly.

        Show
        owen.omalley Owen O'Malley added a comment - We need to have Hive refactored correctly so that this can happen smoothly.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user omalley opened a pull request:

        https://github.com/apache/orc/pull/23

        ORC-1 Import of ORC code from Hive.

        This patch pulls the current Java code for the ORC reader and writer out of Hive. Under the java directory there are three modules:

        • storage-api - a temporary copy of hive's storage api until we release hive with the changes we need
        • core - the core reader and writer for the vectorized reader and writer
        • mapreduce - an implementation of InputFormat and OutputFormat that uses core to read and write row by row

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/omalley/orc orc-1

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/orc/pull/23.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #23



        Show
        githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/23 ORC-1 Import of ORC code from Hive. This patch pulls the current Java code for the ORC reader and writer out of Hive. Under the java directory there are three modules: storage-api - a temporary copy of hive's storage api until we release hive with the changes we need core - the core reader and writer for the vectorized reader and writer mapreduce - an implementation of InputFormat and OutputFormat that uses core to read and write row by row You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/23.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user nahguam commented on the pull request:

        https://github.com/apache/orc/pull/23#issuecomment-214777817

        Hi, what's the best place for comments?

        The diff is too big to comment directly so I'll list a couple here:

        1. OrcRecordReader.next L94 - seems to be a dead-end?

        2. How is one to use OrcStruct as an end user? getFieldValue and setFieldValue are package private. Previously you'd access via the StructObjectInspector, but we seem to have no ObjectInspectors or StructFields.

        FileInputFormat.setInputPaths(conf, path);
        OrcInputFormat<OrcStruct> inputFormat = new OrcInputFormat<>();
        InputSplit[] splits = inputFormat.getSplits(conf, 1);
        RecordReader<NullWritable, OrcStruct> recordReader = inputFormat.getRecordReader(splits[0], conf, null);

        NullWritable key = recordReader.createKey();
        OrcStruct value = recordReader.createValue();

        while (recordReader.next(key, value))

        { // How do I interrogate value for it's fields' values? }

        recordReader.close();

        3. Perhaps for another ticket, but it would be nice to have a mechanism to access a struct's fields by name as well as by index.

        4. Is there any particular reason that the value generic type is V extends Writable instead of OrcStruct?

        Show
        githubbot ASF GitHub Bot added a comment - Github user nahguam commented on the pull request: https://github.com/apache/orc/pull/23#issuecomment-214777817 Hi, what's the best place for comments? The diff is too big to comment directly so I'll list a couple here: 1. OrcRecordReader.next L94 - seems to be a dead-end? 2. How is one to use OrcStruct as an end user? getFieldValue and setFieldValue are package private. Previously you'd access via the StructObjectInspector, but we seem to have no ObjectInspectors or StructFields. FileInputFormat.setInputPaths(conf, path); OrcInputFormat<OrcStruct> inputFormat = new OrcInputFormat<>(); InputSplit[] splits = inputFormat.getSplits(conf, 1); RecordReader<NullWritable, OrcStruct> recordReader = inputFormat.getRecordReader(splits [0] , conf, null); NullWritable key = recordReader.createKey(); OrcStruct value = recordReader.createValue(); while (recordReader.next(key, value)) { // How do I interrogate value for it's fields' values? } recordReader.close(); 3. Perhaps for another ticket, but it would be nice to have a mechanism to access a struct's fields by name as well as by index. 4. Is there any particular reason that the value generic type is V extends Writable instead of OrcStruct?
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user omalley commented on the pull request:

        https://github.com/apache/orc/pull/23#issuecomment-214806455

        1. You're right that you can't actually pass null in to value. smile
        2. I made the accessors public.
        3. I added getters and setters that take the field name.
        4. The input format doesn't need to have a struct as the root type. Look at TestOrcOutputFormat.testLongRoot for an example.

        Show
        githubbot ASF GitHub Bot added a comment - Github user omalley commented on the pull request: https://github.com/apache/orc/pull/23#issuecomment-214806455 1. You're right that you can't actually pass null in to value. smile 2. I made the accessors public. 3. I added getters and setters that take the field name. 4. The input format doesn't need to have a struct as the root type. Look at TestOrcOutputFormat.testLongRoot for an example.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user nahguam commented on the pull request:

        https://github.com/apache/orc/pull/23#issuecomment-215073080

        Excellent, Thanks!

        I've just been going over all the types and have a few more:

        1. `VectorizedRowBatch.toUTF8` appears to be unused
        2. `OrcMap` & `OrcList` - please could the constructors be public?
        3. `OrcList` - please could we have another constructor so we can pass in the initialCapacity as per `ArrayList`?
        4. `OrcUnion` - please could the class itself and the `set` method be public?

        I'm just looking at the Formats/RecordReader/RecordWriter now.

        Show
        githubbot ASF GitHub Bot added a comment - Github user nahguam commented on the pull request: https://github.com/apache/orc/pull/23#issuecomment-215073080 Excellent, Thanks! I've just been going over all the types and have a few more: 1. `VectorizedRowBatch.toUTF8` appears to be unused 2. `OrcMap` & `OrcList` - please could the constructors be public? 3. `OrcList` - please could we have another constructor so we can pass in the initialCapacity as per `ArrayList`? 4. `OrcUnion` - please could the class itself and the `set` method be public? I'm just looking at the Formats/RecordReader/RecordWriter now.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user omalley commented on the pull request:

        https://github.com/apache/orc/pull/23#issuecomment-215112415

        1. The storage-api is a clone of Hive's until they release a version that has the bits that we need. So the method VectorizedRowBatch.toUTF8 is used by Hive.
        2. Done
        3. Done.
        4. Done.

        I also moved TestOrcOutputFormat to a different package so that it can only access the public API, which should prevent similar problems.

        Thanks for your reviews.

        Show
        githubbot ASF GitHub Bot added a comment - Github user omalley commented on the pull request: https://github.com/apache/orc/pull/23#issuecomment-215112415 1. The storage-api is a clone of Hive's until they release a version that has the bits that we need. So the method VectorizedRowBatch.toUTF8 is used by Hive. 2. Done 3. Done. 4. Done. I also moved TestOrcOutputFormat to a different package so that it can only access the public API, which should prevent similar problems. Thanks for your reviews.
        Hide
        owen.omalley Owen O'Malley added a comment -

        The patch is the entire patch. The diff file is the diff between Hive's trunk + HIVE-11417 orc module against this patch's java/core. The only difference is the addition of the TypeDescription parser.

        Show
        owen.omalley Owen O'Malley added a comment - The patch is the entire patch. The diff file is the diff between Hive's trunk + HIVE-11417 orc module against this patch's java/core. The only difference is the addition of the TypeDescription parser.
        Hide
        prasanth_j Prasanth Jayachandran added a comment -

        +1 for the patch.
        Assuming storage-api module will be removed from orc after next release of hive.

        Show
        prasanth_j Prasanth Jayachandran added a comment - +1 for the patch. Assuming storage-api module will be removed from orc after next release of hive.
        Hide
        owen.omalley Owen O'Malley added a comment -

        I just committed this. Thanks for the reviews nahguam and prasanthj!

        Show
        owen.omalley Owen O'Malley added a comment - I just committed this. Thanks for the reviews nahguam and prasanthj!

          People

          • Assignee:
            owen.omalley Owen O'Malley
            Reporter:
            owen.omalley Owen O'Malley
          • Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development