Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-58

Move code for reading rows from Reader to RowReader

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: C++
    • Labels:
      None

      Description

      Existing ReaderImpl constructor can throw an exception. This prohibits the creation of the reader instance and subsequent access to the schema information.
      For instance, an exception can be thrown if the selected column ids do not agree with the number of schema columns. The downstream application might still want the schema information for logging purposes.

      The scope of this Jira is to move the code to read rows into a new RowReader class.

        Issue Links

          Activity

          Hide
          mdeepak Deepak Majeti added a comment -

          Owen O'Malley Can you please let me know your thoughts on this ?
          Thanks!

          Show
          mdeepak Deepak Majeti added a comment - Owen O'Malley Can you please let me know your thoughts on this ? Thanks!
          Hide
          owen.omalley Owen O'Malley added a comment -

          The exceptions are going to be from cases that are hard to recover from. If we can't read the file or it is corrupted, it is better to have the reader fail with an error than to silently try to recover.

          Moving the relevant code to an open method that has to be called before any data is available doesn't help and just makes the API more error-prone.

          Show
          owen.omalley Owen O'Malley added a comment - The exceptions are going to be from cases that are hard to recover from. If we can't read the file or it is corrupted, it is better to have the reader fail with an error than to silently try to recover. Moving the relevant code to an open method that has to be called before any data is available doesn't help and just makes the API more error-prone.
          Hide
          owen.omalley Owen O'Malley added a comment -

          Talking to Stephen off line, I understand your use case better. I'd suggest that we separate the Reader into two classes:

          • Reader - that reads the file footer
          • RowReader - that reads the rows

          That will allow the user to read the file footer to determine the schema before they have to pick which columns to include.

          Thoughts?

          Show
          owen.omalley Owen O'Malley added a comment - Talking to Stephen off line, I understand your use case better. I'd suggest that we separate the Reader into two classes: Reader - that reads the file footer RowReader - that reads the rows That will allow the user to read the file footer to determine the schema before they have to pick which columns to include. Thoughts?
          Hide
          mdeepak Deepak Majeti added a comment -

          Creating a separate API for reading the file footer makes sense and provides a cleaner interface. I will modify this JIRA to this approach.

          Show
          mdeepak Deepak Majeti added a comment - Creating a separate API for reading the file footer makes sense and provides a cleaner interface. I will modify this JIRA to this approach.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user majetideepak opened a pull request:

          https://github.com/apache/orc/pull/41

          ORC-58: Move code for reading rows from Reader to RowReader

          This PR divides the current Reader class into Reader class and RowReader class
          Reader class reads the footer and metadata from an ORC file.
          RowReader class reads the rows from the file.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/majetideepak/orc RowReader

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/41.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #41


          commit ed02d7ab668c1a61f8260e0efddf06c617658753
          Author: Deepak Majeti <majeti.deepak@gmail.com>
          Date: 2016-06-05T21:51:45Z

          Split curent Reader into RowReader and Reader


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/41 ORC-58 : Move code for reading rows from Reader to RowReader This PR divides the current Reader class into Reader class and RowReader class Reader class reads the footer and metadata from an ORC file. RowReader class reads the rows from the file. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc RowReader Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/41.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #41 commit ed02d7ab668c1a61f8260e0efddf06c617658753 Author: Deepak Majeti <majeti.deepak@gmail.com> Date: 2016-06-05T21:51:45Z Split curent Reader into RowReader and Reader
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on the issue:

          https://github.com/apache/orc/pull/41

          @omalley, please let me know your feedback. Thanks!

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/41 @omalley, please let me know your feedback. Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user swalkaus commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r70967406

          — Diff: tools/src/FileContents.cc —
          @@ -27,8 +27,10 @@
          #include <string>

          void printContents(const char* filename, const orc::ReaderOptions opts) {

          • std::unique_ptr<orc::Reader> reader;
          • reader = orc::createReader(orc::readLocalFile(std::string(filename)), opts);
            + std::unique_ptr<orc::Reader> fileReader;
              • End diff –

          Pendantic: use reader and rowReader consistently (notice the camelCase).

          Show
          githubbot ASF GitHub Bot added a comment - Github user swalkaus commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r70967406 — Diff: tools/src/FileContents.cc — @@ -27,8 +27,10 @@ #include <string> void printContents(const char* filename, const orc::ReaderOptions opts) { std::unique_ptr<orc::Reader> reader; reader = orc::createReader(orc::readLocalFile(std::string(filename)), opts); + std::unique_ptr<orc::Reader> fileReader; End diff – Pendantic: use reader and rowReader consistently (notice the camelCase).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user swalkaus commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r70968789

          — Diff: c++/include/orc/Reader.hh —
          @@ -645,7 +645,72 @@ namespace orc {
          };

          /**

          • * The interface for reading ORC files.
            + * The interface for reading rows in ORC files.
            + * This is an an abstract class that will subclassed as necessary.
            + */
            + class RowReader { + public: + virtual ~RowReader(); + /** + * Get the selected type of the rows in the file. The file's row type + * is projected down to just the selected columns. Thus, if the file's + * type is struct<col0:int,col1:double,col2:string> and the selected + * columns are "col0,col2" the selected type would be + * struct<col0:int,col2:string>. + * @return the root type + */ + virtual const Type& getSelectedType() const = 0; + + /** + * Get the selected columns of the file. + */ + virtual const std::vector<bool> getSelectedColumns() const = 0; + + /** + * Create a row batch for reading the selected columns of this file. + * @param size the number of rows to read + * @return a new ColumnVectorBatch to read into + */ + virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size + ) const = 0; + + /** + * Read the next row batch from the current position. + * Caller must look at numElements in the row batch to determine how + * many rows were read. + * @param data the row batch to read into. + * @return true if a non-zero number of rows were read or false if the + * end of the file was reached. + */ + virtual bool next(ColumnVectorBatch& data) = 0; + + /** + * Get the row number of the first row in the previously read batch. + * @return the row number of the previous batch. + */ + virtual uint64_t getRowNumber() const = 0; + + /** + * Seek to a given row. + * @param rowNumber the next row the reader should return + */ + virtual void seekToRow(uint64_t rowNumber) = 0; + + /** + * Estimate an upper bound on heap memory allocation by the Reader + * based on the information in the file footer. + * The bound is less tight if only few columns are read or compression is + * used. + * @param stripeIx index of the stripe to be read (if not specified, + * all stripes are considered). + * @return upper bound on memory use + */ + virtual uint64_t getMemoryUse(int stripeIx=-1) = 0; + + }

            ;
            +
            + /**
            + * The interface for reading footer in ORC files.

              • End diff –

          "The interface for reading ORC file meta-data."

          Show
          githubbot ASF GitHub Bot added a comment - Github user swalkaus commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r70968789 — Diff: c++/include/orc/Reader.hh — @@ -645,7 +645,72 @@ namespace orc { }; /** * The interface for reading ORC files. + * The interface for reading rows in ORC files. + * This is an an abstract class that will subclassed as necessary. + */ + class RowReader { + public: + virtual ~RowReader(); + /** + * Get the selected type of the rows in the file. The file's row type + * is projected down to just the selected columns. Thus, if the file's + * type is struct<col0:int,col1:double,col2:string> and the selected + * columns are "col0,col2" the selected type would be + * struct<col0:int,col2:string>. + * @return the root type + */ + virtual const Type& getSelectedType() const = 0; + + /** + * Get the selected columns of the file. + */ + virtual const std::vector<bool> getSelectedColumns() const = 0; + + /** + * Create a row batch for reading the selected columns of this file. + * @param size the number of rows to read + * @return a new ColumnVectorBatch to read into + */ + virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size + ) const = 0; + + /** + * Read the next row batch from the current position. + * Caller must look at numElements in the row batch to determine how + * many rows were read. + * @param data the row batch to read into. + * @return true if a non-zero number of rows were read or false if the + * end of the file was reached. + */ + virtual bool next(ColumnVectorBatch& data) = 0; + + /** + * Get the row number of the first row in the previously read batch. + * @return the row number of the previous batch. + */ + virtual uint64_t getRowNumber() const = 0; + + /** + * Seek to a given row. + * @param rowNumber the next row the reader should return + */ + virtual void seekToRow(uint64_t rowNumber) = 0; + + /** + * Estimate an upper bound on heap memory allocation by the Reader + * based on the information in the file footer. + * The bound is less tight if only few columns are read or compression is + * used. + * @param stripeIx index of the stripe to be read (if not specified, + * all stripes are considered). + * @return upper bound on memory use + */ + virtual uint64_t getMemoryUse(int stripeIx=-1) = 0; + + } ; + + /** + * The interface for reading footer in ORC files. End diff – "The interface for reading ORC file meta-data."
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user swalkaus commented on the issue:

          https://github.com/apache/orc/pull/41

          I might rename "Reader" something like "FileTailReader" but, in general, the refactoring makes sense. The schema can be read and columns metadata known before column selection is decided.

          Show
          githubbot ASF GitHub Bot added a comment - Github user swalkaus commented on the issue: https://github.com/apache/orc/pull/41 I might rename "Reader" something like "FileTailReader" but, in general, the refactoring makes sense. The schema can be read and columns metadata known before column selection is decided.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user swalkaus commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71004296

          — Diff: c++/src/Reader.cc —
          @@ -1108,13 +1107,75 @@ namespace orc

          { // internal methods proto::StripeFooter getStripeFooter(const proto::StripeInformation& info); void startNextStripe(); - void checkOrcVersion(); void selectType(const Type& type); - void readMetadata() const; void updateSelected(const std::list<uint64_t>& fieldIds); void updateSelected(const std::list<std::string>& fieldNames); public: + /** + * Constructor that lets the user specify additional options. + * @param filereader the object to read from + * @param options options for reading + */ + RowReaderImpl(const ReaderImpl* filereader, + std::shared_ptr<ReaderOptions> options); + + uint64_t getMemoryUse(int stripeIx = -1) override; + + const std::vector<bool> getSelectedColumns() const override; + + const Type& getSelectedType() const override; + + std::unique_ptr<ColumnVectorBatch> createRowBatch(uint64_t size + ) const override; + + bool next(ColumnVectorBatch& data) override; + + const ReaderOptions& getReaderOptions() const; + + CompressionKind getCompression() const; + + uint64_t getCompressionSize() const; + + uint64_t getRowNumber() const override; + + void seekToRow(uint64_t rowNumber) override; + + MemoryPool* getMemoryPool() const ; + + }

          ;
          +
          + class ReaderImpl : public Reader {
          + friend class RowReaderImpl;
          — End diff –

          Avoid using friend.

          Show
          githubbot ASF GitHub Bot added a comment - Github user swalkaus commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71004296 — Diff: c++/src/Reader.cc — @@ -1108,13 +1107,75 @@ namespace orc { // internal methods proto::StripeFooter getStripeFooter(const proto::StripeInformation& info); void startNextStripe(); - void checkOrcVersion(); void selectType(const Type& type); - void readMetadata() const; void updateSelected(const std::list<uint64_t>& fieldIds); void updateSelected(const std::list<std::string>& fieldNames); public: + /** + * Constructor that lets the user specify additional options. + * @param filereader the object to read from + * @param options options for reading + */ + RowReaderImpl(const ReaderImpl* filereader, + std::shared_ptr<ReaderOptions> options); + + uint64_t getMemoryUse(int stripeIx = -1) override; + + const std::vector<bool> getSelectedColumns() const override; + + const Type& getSelectedType() const override; + + std::unique_ptr<ColumnVectorBatch> createRowBatch(uint64_t size + ) const override; + + bool next(ColumnVectorBatch& data) override; + + const ReaderOptions& getReaderOptions() const; + + CompressionKind getCompression() const; + + uint64_t getCompressionSize() const; + + uint64_t getRowNumber() const override; + + void seekToRow(uint64_t rowNumber) override; + + MemoryPool* getMemoryPool() const ; + + } ; + + class ReaderImpl : public Reader { + friend class RowReaderImpl; — End diff – Avoid using friend.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user swalkaus commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71004575

          — Diff: c++/src/Reader.cc —
          @@ -1108,13 +1107,75 @@ namespace orc {
          // internal methods
          proto::StripeFooter getStripeFooter(const proto::StripeInformation& info);
          void startNextStripe();

          • void checkOrcVersion();
            void selectType(const Type& type);
          • void readMetadata() const;
            void updateSelected(const std::list<uint64_t>& fieldIds);
            void updateSelected(const std::list<std::string>& fieldNames);

          public:
          + /**
          + * Constructor that lets the user specify additional options.
          + * @param filereader the object to read from
          + * @param options options for reading
          + */
          + RowReaderImpl(const ReaderImpl* filereader,
          — End diff –

          Should filereader be a "Reader" or "ReaderImpl"?

          Show
          githubbot ASF GitHub Bot added a comment - Github user swalkaus commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71004575 — Diff: c++/src/Reader.cc — @@ -1108,13 +1107,75 @@ namespace orc { // internal methods proto::StripeFooter getStripeFooter(const proto::StripeInformation& info); void startNextStripe(); void checkOrcVersion(); void selectType(const Type& type); void readMetadata() const; void updateSelected(const std::list<uint64_t>& fieldIds); void updateSelected(const std::list<std::string>& fieldNames); public: + /** + * Constructor that lets the user specify additional options. + * @param filereader the object to read from + * @param options options for reading + */ + RowReaderImpl(const ReaderImpl* filereader, — End diff – Should filereader be a "Reader" or "ReaderImpl"?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71029025

          — Diff: c++/src/Reader.cc —
          @@ -1108,13 +1107,75 @@ namespace orc {
          // internal methods
          proto::StripeFooter getStripeFooter(const proto::StripeInformation& info);
          void startNextStripe();

          • void checkOrcVersion();
            void selectType(const Type& type);
          • void readMetadata() const;
            void updateSelected(const std::list<uint64_t>& fieldIds);
            void updateSelected(const std::list<std::string>& fieldNames);

          public:
          + /**
          + * Constructor that lets the user specify additional options.
          + * @param filereader the object to read from
          + * @param options options for reading
          + */
          + RowReaderImpl(const ReaderImpl* filereader,
          — End diff –

          We will have to extend the `Reader` interface with a notion of `metadata`, `ReaderOptions` etc. to achieve this. I am not sure what the right design choice is here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71029025 — Diff: c++/src/Reader.cc — @@ -1108,13 +1107,75 @@ namespace orc { // internal methods proto::StripeFooter getStripeFooter(const proto::StripeInformation& info); void startNextStripe(); void checkOrcVersion(); void selectType(const Type& type); void readMetadata() const; void updateSelected(const std::list<uint64_t>& fieldIds); void updateSelected(const std::list<std::string>& fieldNames); public: + /** + * Constructor that lets the user specify additional options. + * @param filereader the object to read from + * @param options options for reading + */ + RowReaderImpl(const ReaderImpl* filereader, — End diff – We will have to extend the `Reader` interface with a notion of `metadata`, `ReaderOptions` etc. to achieve this. I am not sure what the right design choice is here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71377966

          — Diff: c++/include/orc/Reader.hh —
          @@ -645,7 +645,72 @@ namespace orc {
          };

          /**

          • * The interface for reading ORC files.
            + * The interface for reading rows in ORC files.
            + * This is an an abstract class that will subclassed as necessary.
            + */
            + class RowReader {
            + public:
            + virtual ~RowReader();
            + /**
            + * Get the selected type of the rows in the file. The file's row type
            + * is projected down to just the selected columns. Thus, if the file's
            + * type is struct<col0:int,col1:double,col2:string> and the selected
            + * columns are "col0,col2" the selected type would be
            + * struct<col0:int,col2:string>.
            + * @return the root type
            + */
            + virtual const Type& getSelectedType() const = 0;
            +
            + /**
            + * Get the selected columns of the file.
            + */
            + virtual const std::vector<bool> getSelectedColumns() const = 0;
            +
              • End diff –

          You have some trailing spaces. You can probably configure your IDE to automatically trim them.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71377966 — Diff: c++/include/orc/Reader.hh — @@ -645,7 +645,72 @@ namespace orc { }; /** * The interface for reading ORC files. + * The interface for reading rows in ORC files. + * This is an an abstract class that will subclassed as necessary. + */ + class RowReader { + public: + virtual ~RowReader(); + /** + * Get the selected type of the rows in the file. The file's row type + * is projected down to just the selected columns. Thus, if the file's + * type is struct<col0:int,col1:double,col2:string> and the selected + * columns are "col0,col2" the selected type would be + * struct<col0:int,col2:string>. + * @return the root type + */ + virtual const Type& getSelectedType() const = 0; + + /** + * Get the selected columns of the file. + */ + virtual const std::vector<bool> getSelectedColumns() const = 0; + End diff – You have some trailing spaces. You can probably configure your IDE to automatically trim them.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71378147

          — Diff: c++/include/orc/Reader.hh —
          @@ -645,7 +645,72 @@ namespace orc {
          };

          /**

          • * The interface for reading ORC files.
            + * The interface for reading rows in ORC files.
            + * This is an an abstract class that will subclassed as necessary.
            + */
            + class RowReader {
            + public:
            + virtual ~RowReader();
            + /**
            + * Get the selected type of the rows in the file. The file's row type
            + * is projected down to just the selected columns. Thus, if the file's
            + * type is struct<col0:int,col1:double,col2:string> and the selected
            + * columns are "col0,col2" the selected type would be
            + * struct<col0:int,col2:string>.
            + * @return the root type
            + */
            + virtual const Type& getSelectedType() const = 0;
            +
            + /**
            + * Get the selected columns of the file.
            + */
            + virtual const std::vector<bool> getSelectedColumns() const = 0;
            +
            + /**
            + * Create a row batch for reading the selected columns of this file.
            + * @param size the number of rows to read
            + * @return a new ColumnVectorBatch to read into
            + */
            + virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size
            + ) const = 0;
            +
            + /**
            + * Read the next row batch from the current position.
            + * Caller must look at numElements in the row batch to determine how
            + * many rows were read.
            + * @param data the row batch to read into.
            + * @return true if a non-zero number of rows were read or false if the
            + * end of the file was reached.
            + */
            + virtual bool next(ColumnVectorBatch& data) = 0;
            +
            + /**
            + * Get the row number of the first row in the previously read batch.
            + * @return the row number of the previous batch.
            + */
            + virtual uint64_t getRowNumber() const = 0;
            +
            + /**
            + * Seek to a given row.
            + * @param rowNumber the next row the reader should return
            + */
            + virtual void seekToRow(uint64_t rowNumber) = 0;
            +
            + /**
            + * Estimate an upper bound on heap memory allocation by the Reader
            + * based on the information in the file footer.
            + * The bound is less tight if only few columns are read or compression is
            + * used.
            + * @param stripeIx index of the stripe to be read (if not specified,
            + * all stripes are considered).
            + * @return upper bound on memory use
            + */
            + virtual uint64_t getMemoryUse(int stripeIx=-1) = 0;
              • End diff –

          This should stay in the reader.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71378147 — Diff: c++/include/orc/Reader.hh — @@ -645,7 +645,72 @@ namespace orc { }; /** * The interface for reading ORC files. + * The interface for reading rows in ORC files. + * This is an an abstract class that will subclassed as necessary. + */ + class RowReader { + public: + virtual ~RowReader(); + /** + * Get the selected type of the rows in the file. The file's row type + * is projected down to just the selected columns. Thus, if the file's + * type is struct<col0:int,col1:double,col2:string> and the selected + * columns are "col0,col2" the selected type would be + * struct<col0:int,col2:string>. + * @return the root type + */ + virtual const Type& getSelectedType() const = 0; + + /** + * Get the selected columns of the file. + */ + virtual const std::vector<bool> getSelectedColumns() const = 0; + + /** + * Create a row batch for reading the selected columns of this file. + * @param size the number of rows to read + * @return a new ColumnVectorBatch to read into + */ + virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size + ) const = 0; + + /** + * Read the next row batch from the current position. + * Caller must look at numElements in the row batch to determine how + * many rows were read. + * @param data the row batch to read into. + * @return true if a non-zero number of rows were read or false if the + * end of the file was reached. + */ + virtual bool next(ColumnVectorBatch& data) = 0; + + /** + * Get the row number of the first row in the previously read batch. + * @return the row number of the previous batch. + */ + virtual uint64_t getRowNumber() const = 0; + + /** + * Seek to a given row. + * @param rowNumber the next row the reader should return + */ + virtual void seekToRow(uint64_t rowNumber) = 0; + + /** + * Estimate an upper bound on heap memory allocation by the Reader + * based on the information in the file footer. + * The bound is less tight if only few columns are read or compression is + * used. + * @param stripeIx index of the stripe to be read (if not specified, + * all stripes are considered). + * @return upper bound on memory use + */ + virtual uint64_t getMemoryUse(int stripeIx=-1) = 0; End diff – This should stay in the reader.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71379015

          — Diff: c++/include/orc/Reader.hh —
          @@ -782,85 +847,42 @@ namespace orc {
          getColumnStatistics(uint32_t columnId) const = 0;

          /**

          • * Get the type of the rows in the file. The top level is typically a
          • * struct.
          • * @return the root type
          • */
          • virtual const Type& getType() const = 0;
            -
          • /**
          • * Get the selected type of the rows in the file. The file's row type
          • * is projected down to just the selected columns. Thus, if the file's
          • * type is struct<col0:int,col1:double,col2:string> and the selected
          • * columns are "col0,col2" the selected type would be
          • * struct<col0:int,col2:string>.
          • * @return the root type
          • */
          • virtual const Type& getSelectedType() const = 0;
            -
          • /**
          • * Get the selected columns of the file.
            + * check file has correct column statistics
            */
          • virtual const std::vector<bool> getSelectedColumns() const = 0;
            + virtual bool hasCorrectStatistics() const = 0;

          /**

          • * Create a row batch for reading the selected columns of this file.
          • * @param size the number of rows to read
          • * @return a new ColumnVectorBatch to read into
            + * Get the serialized file tail.
            + * Usefull if another reader of the same file wants to avoid re-reading
            + * the file tail. See ReaderOptions.setSerializedFileTail().
            + * @return a string of bytes with the file tail
            */
          • virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size
          • ) const = 0;
            + virtual std::string getSerializedFileTail() const = 0;

          /**

          • * Read the next row batch from the current position.
          • * Caller must look at numElements in the row batch to determine how
          • * many rows were read.
          • * @param data the row batch to read into.
          • * @return true if a non-zero number of rows were read or false if the
          • * end of the file was reached.
            + * Get the type of the rows in the file. The top level is typically a
            + * struct.
            + * @return the root type
            */
          • virtual bool next(ColumnVectorBatch& data) = 0;
            + virtual const Type& getType() const = 0;

          /**

          • * Get the row number of the first row in the previously read batch.
          • * @return the row number of the previous batch.
            + * @return a RowReader to read the rows
            */
          • virtual uint64_t getRowNumber() const = 0;
            + virtual ORC_UNIQUE_PTR<RowReader> getRowReader() const = 0;

          /**

          • * Seek to a given row.
          • * @param rowNumber the next row the reader should return
            + * @param include update with new columns
            + * @return a RowReader to read the rows
            */
          • virtual void seekToRow(uint64_t rowNumber) = 0;
            + virtual ORC_UNIQUE_PTR<RowReader>
            + getRowReader(const std::list<uint64_t>& include) const = 0;
              • End diff –

          Let's go ahead and make a RowReaderOptions class and pass that in here. It is almost guaranteed that the include vector will not be the only option that we want to pass in. One of the options will be to specify the include vector.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71379015 — Diff: c++/include/orc/Reader.hh — @@ -782,85 +847,42 @@ namespace orc { getColumnStatistics(uint32_t columnId) const = 0; /** * Get the type of the rows in the file. The top level is typically a * struct. * @return the root type */ virtual const Type& getType() const = 0; - /** * Get the selected type of the rows in the file. The file's row type * is projected down to just the selected columns. Thus, if the file's * type is struct<col0:int,col1:double,col2:string> and the selected * columns are "col0,col2" the selected type would be * struct<col0:int,col2:string>. * @return the root type */ virtual const Type& getSelectedType() const = 0; - /** * Get the selected columns of the file. + * check file has correct column statistics */ virtual const std::vector<bool> getSelectedColumns() const = 0; + virtual bool hasCorrectStatistics() const = 0; /** * Create a row batch for reading the selected columns of this file. * @param size the number of rows to read * @return a new ColumnVectorBatch to read into + * Get the serialized file tail. + * Usefull if another reader of the same file wants to avoid re-reading + * the file tail. See ReaderOptions.setSerializedFileTail(). + * @return a string of bytes with the file tail */ virtual ORC_UNIQUE_PTR<ColumnVectorBatch> createRowBatch(uint64_t size ) const = 0; + virtual std::string getSerializedFileTail() const = 0; /** * Read the next row batch from the current position. * Caller must look at numElements in the row batch to determine how * many rows were read. * @param data the row batch to read into. * @return true if a non-zero number of rows were read or false if the * end of the file was reached. + * Get the type of the rows in the file. The top level is typically a + * struct. + * @return the root type */ virtual bool next(ColumnVectorBatch& data) = 0; + virtual const Type& getType() const = 0; /** * Get the row number of the first row in the previously read batch. * @return the row number of the previous batch. + * @return a RowReader to read the rows */ virtual uint64_t getRowNumber() const = 0; + virtual ORC_UNIQUE_PTR<RowReader> getRowReader() const = 0; /** * Seek to a given row. * @param rowNumber the next row the reader should return + * @param include update with new columns + * @return a RowReader to read the rows */ virtual void seekToRow(uint64_t rowNumber) = 0; + virtual ORC_UNIQUE_PTR<RowReader> + getRowReader(const std::list<uint64_t>& include) const = 0; End diff – Let's go ahead and make a RowReaderOptions class and pass that in here. It is almost guaranteed that the include vector will not be the only option that we want to pass in. One of the options will be to specify the include vector.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/41

          The ReaderOptions needs to split in half too:

          • ReaderOptions
          • setTailLocation
          • setErrorStream
          • setSerializedFooter
          • RowReaderOptions
          • include
          • range
          • throwOnHive11DecimalOverflow
          • forcedScaleOnHive11Decimal
          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/41 The ReaderOptions needs to split in half too: ReaderOptions setTailLocation setErrorStream setSerializedFooter RowReaderOptions include range throwOnHive11DecimalOverflow forcedScaleOnHive11Decimal
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/41#discussion_r71443174

          — Diff: c++/src/Reader.cc —
          @@ -1062,37 +1062,36 @@ namespace orc

          { // PASS }

          + RowReader::~RowReader()

          { + // PASS + }

          +
          static const uint64_t DIRECTORY_SIZE_GUESS = 16 * 1024;

          • class ReaderImpl : public Reader {
            + class ReaderImpl;
            +
            + class RowReaderImpl : public RowReader {
            private:
            const Timezone& localTimezone;

          // inputs

          • std::unique_ptr<InputStream> stream;
          • ReaderOptions options;
          • const uint64_t fileLength;
          • const uint64_t postscriptLength;
            std::vector<bool> selectedColumns;
            -
            + std::shared_ptr<InputStream> stream;
              • End diff –

          I'd suggest creating a single class that has all of the state for the ReaderImpl. Then it can have the postscript, InputStream, etc. and be shared between the ReaderImpl and the set of RowReaderImpls that share the same ReaderImpl.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/41#discussion_r71443174 — Diff: c++/src/Reader.cc — @@ -1062,37 +1062,36 @@ namespace orc { // PASS } + RowReader::~RowReader() { + // PASS + } + static const uint64_t DIRECTORY_SIZE_GUESS = 16 * 1024; class ReaderImpl : public Reader { + class ReaderImpl; + + class RowReaderImpl : public RowReader { private: const Timezone& localTimezone; // inputs std::unique_ptr<InputStream> stream; ReaderOptions options; const uint64_t fileLength; const uint64_t postscriptLength; std::vector<bool> selectedColumns; - + std::shared_ptr<InputStream> stream; End diff – I'd suggest creating a single class that has all of the state for the ReaderImpl. Then it can have the postscript, InputStream, etc. and be shared between the ReaderImpl and the set of RowReaderImpls that share the same ReaderImpl.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on the issue:

          https://github.com/apache/orc/pull/41

          Included review comments.
          This patch is huge because I split the monolithic Reader.cc into various files

          • Reader.hh/Reader.cc: ReaderImpl and RowReaderImpl
          • Statistics.hh/Statistics.cc: ColumnStatisticsImpl, StatisticsImpl, Type*StatisticsImpl
          • StripeStream.hh/StripeStream.cc: StripeStreamsImpl, StreamInformationImpl, StripeInformationImpl
          • Options.hh: ReaderOptions, RowReaderOptions

          @omalley any feedback ?

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/41 Included review comments. This patch is huge because I split the monolithic Reader.cc into various files Reader.hh/Reader.cc: ReaderImpl and RowReaderImpl Statistics.hh/Statistics.cc: ColumnStatisticsImpl, StatisticsImpl, Type*StatisticsImpl StripeStream.hh/StripeStream.cc: StripeStreamsImpl, StreamInformationImpl, StripeInformationImpl Options.hh: ReaderOptions, RowReaderOptions @omalley any feedback ?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/41

          This is looking good. To compile on MacOS, I needed to replace some of the std::move with REDUNDANT_MOVE. See omalley:pr/41

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/41 This is looking good. To compile on MacOS, I needed to replace some of the std::move with REDUNDANT_MOVE. See omalley:pr/41
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on the issue:

          https://github.com/apache/orc/pull/41

          Replaced with REDUNDANT_MOVE. valgrind does not show any memory leaks due to this patch.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/41 Replaced with REDUNDANT_MOVE. valgrind does not show any memory leaks due to this patch.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/41

          I think we should have FileContents be a shared ptr so that if the user deallocates the Reader, the RowReader won't reach into deallocated memory.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/41 I think we should have FileContents be a shared ptr so that if the user deallocates the Reader, the RowReader won't reach into deallocated memory.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on the issue:

          https://github.com/apache/orc/pull/41

          I updated the patch to use a `shared_ptr` for `FileContents`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/41 I updated the patch to use a `shared_ptr` for `FileContents`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/41

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/41
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this. Thanks, Deepak!

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this. Thanks, Deepak!
          Hide
          mdeepak Deepak Majeti added a comment -

          Thanks Owen O'Malley! This is very helpful.

          Show
          mdeepak Deepak Majeti added a comment - Thanks Owen O'Malley ! This is very helpful.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/41

          Deepak,
          We got a notice of a new problem from Coverity. Can you look at it?

            • CID 173749: Uninitialized members (UNINIT_CTOR)
              /c++/src/Reader.cc: 259 in orc::RowReaderImpl::RowReaderImpl(std::shared_ptr<orc::FileContents>, const orc::RowReaderOptions &)()

          ________________________________________________________________________________________________________

              • CID 173749: Uninitialized members (UNINIT_CTOR)
                /c++/src/Reader.cc: 259 in orc::RowReaderImpl::RowReaderImpl(std::shared_ptr<orc::FileContents>, const orc::RowReaderOptions &)()
                253 } else { 254 previousRow = firstRowOfStripe[firstStripe]-1; 255 }

                256
                257 ColumnSelector column_selector(contents.get());
                258 column_selector.updateSelected(selectedColumns, options);
                >>> CID 173749: Uninitialized members (UNINIT_CTOR)
                >>> Non-static class member "rowsInCurrentStripe" is not initialized in this constructor nor in any functions that it calls.
                259 }
                260
                261 const RowReaderOptions& RowReaderImpl::getRowReaderOptions() const

                { 262 return options; 263 }

                264

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/41 Deepak, We got a notice of a new problem from Coverity. Can you look at it? CID 173749: Uninitialized members (UNINIT_CTOR) /c++/src/Reader.cc: 259 in orc::RowReaderImpl::RowReaderImpl(std::shared_ptr<orc::FileContents>, const orc::RowReaderOptions &)() ________________________________________________________________________________________________________ CID 173749: Uninitialized members (UNINIT_CTOR) /c++/src/Reader.cc: 259 in orc::RowReaderImpl::RowReaderImpl(std::shared_ptr<orc::FileContents>, const orc::RowReaderOptions &)() 253 } else { 254 previousRow = firstRowOfStripe[firstStripe]-1; 255 } 256 257 ColumnSelector column_selector(contents.get()); 258 column_selector.updateSelected(selectedColumns, options); >>> CID 173749: Uninitialized members (UNINIT_CTOR) >>> Non-static class member "rowsInCurrentStripe" is not initialized in this constructor nor in any functions that it calls. 259 } 260 261 const RowReaderOptions& RowReaderImpl::getRowReaderOptions() const { 262 return options; 263 } 264
          Hide
          githubbot ASF GitHub Bot added a comment -
          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/41 @omalley https://issues.apache.org/jira/browse/ORC-127 https://github.com/apache/orc/pull/77 Thanks.
          Hide
          owen.omalley Owen O'Malley added a comment -

          ORC 1.3.0 was released.

          Show
          owen.omalley Owen O'Malley added a comment - ORC 1.3.0 was released.

            People

            • Assignee:
              mdeepak Deepak Majeti
              Reporter:
              mdeepak Deepak Majeti
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development