Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11782

[GLib][Ruby][Dataset] Remove bindings for internal classes

Details

    Description

      GLib and ruby include bindings for internal classes such as ScanOptions, ScanContext, InMemoryScanTask, ScanTask, ... These are probably unnecessary and should be removed to present a simpler interface less prone to breakage under refactoring of the wrapped classes https://github.com/apache/arrow/pull/9532/checks?check_run_id=1974229719#step:8:2071

      Attachments

        Issue Links

          Activity

            bkietz Ben Kietzman added a comment - mrkn
            westonpace Weston Pace added a comment -

            This may be an issue for ARROW-7001.  ScanTask is changing to use some of the internal async utilities which I do not believe we are not exposing yet (AsyncGenerator, Future, etc.)

            westonpace Weston Pace added a comment - This may be an issue for ARROW-7001 .  ScanTask is changing to use some of the internal async utilities which I do not believe we are not exposing yet (AsyncGenerator, Future, etc.)
            kou Kouhei Sutou added a comment - - edited

            It seems that pyarrow also has bindings of ScanTask, .... Should we also remove them from pyarrow?

            kou Kouhei Sutou added a comment - - edited It seems that pyarrow also has bindings of ScanTask , .... Should we also remove them from pyarrow?
            kou Kouhei Sutou added a comment -

            R also has ScanTask bindings.

            kou Kouhei Sutou added a comment - R also has ScanTask bindings.
            westonpace Weston Pace added a comment -

            Thanks for pointing this out.  I think we can't get rid of ScanTask.  I will create a ScanTaskInternal (or some similar name) for the work I'm doing and leave ScanTask's interface as is.

            In the future though it might be better to change Scanner::Scan to return a RecordBatchIterator instead of ScanTaskIterator.  Otherwise we're putting the burden on the user to manage how many files are loaded concurrently.  I think that should either be hidden completely or controlled by some option (MaxConcurrentFiles or, better yet, MaxAllocatedBytes)

            westonpace Weston Pace added a comment - Thanks for pointing this out.  I think we can't get rid of ScanTask.  I will create a ScanTaskInternal (or some similar name) for the work I'm doing and leave ScanTask's interface as is. In the future though it might be better to change Scanner::Scan to return a RecordBatchIterator instead of ScanTaskIterator.  Otherwise we're putting the burden on the user to manage how many files are loaded concurrently.  I think that should either be hidden completely or controlled by some option (MaxConcurrentFiles or, better yet, MaxAllocatedBytes)
            bkietz Ben Kietzman added a comment -

            IIUC, python and R only use ScanTasks when materializing a stream of RecordBatches instead of using Scanner::ToTable. That functionality could easily be moved upstream into C++ by providing Scanner::GetRecordBatches or similar, which would be easier to consume from bindings in any case. Therefore I'd say the wrappers for ScanTask could also be removed from Python and R.

            If that's not acceptable, I'd think that ARROW-7001 should prefer to add ScanTask::ExecuteAsync and refactor ScanTask::Execute to be a backwards compatible wrapper.

            jorisvandenbossche npr westonpace

            bkietz Ben Kietzman added a comment - IIUC, python and R only use ScanTasks when materializing a stream of RecordBatches instead of using Scanner::ToTable. That functionality could easily be moved upstream into C++ by providing Scanner::GetRecordBatches or similar, which would be easier to consume from bindings in any case. Therefore I'd say the wrappers for ScanTask could also be removed from Python and R. If that's not acceptable, I'd think that ARROW-7001 should prefer to add ScanTask::ExecuteAsync and refactor ScanTask::Execute to be a backwards compatible wrapper. jorisvandenbossche npr westonpace

            I would love to delete ScanTask from the R bindings. The reason they're exposed there is to support a (hacky, experimental) attempt to do computations on the stream of record batches so that it's possible to compute things that we couldn't do otherwise because we can't hold the whole Table in memory. So Scanner::ToBatches doesn't work in that case because everything would be materialized.

            What I really want is to be able to essentially pass a function/lambda to something like ToTable or ToBatches and have that function be applied to every record batch in the stream. I don't want to manage consuming the ScanTasks/RecordBatchIterators, I'd prefer to have the C++ library handle that. (In my current hacky use of ScanTasks, it's actually prohibitively slow because it has to consume the iterators single-threaded.)

            npr Neal Richardson added a comment - I would love to delete ScanTask from the R bindings. The reason they're exposed there is to support a (hacky, experimental) attempt to do computations on the stream of record batches so that it's possible to compute things that we couldn't do otherwise because we can't hold the whole Table in memory. So Scanner::ToBatches doesn't work in that case because everything would be materialized. What I really want is to be able to essentially pass a function/lambda to something like ToTable or ToBatches and have that function be applied to every record batch in the stream. I don't want to manage consuming the ScanTasks/RecordBatchIterators, I'd prefer to have the C++ library handle that. (In my current hacky use of ScanTasks, it's actually prohibitively slow because it has to consume the iterators single-threaded.)
            bkietz Ben Kietzman added a comment -

            We could certainly provide Scanner::VisitBatches instead of/in addition to Scanner::ToBatches. I'll open a JIRA for both methods and make a PR today, so that everyone here can see if it accomodates their use cases

            bkietz Ben Kietzman added a comment - We could certainly provide Scanner::VisitBatches instead of/in addition to Scanner::ToBatches. I'll open a JIRA for both methods and make a PR today, so that everyone here can see if it accomodates their use cases
            bkietz Ben Kietzman added a comment - https://issues.apache.org/jira/browse/ARROW-11797
            kou Kouhei Sutou added a comment -

            Thanks. I understand.

            I agree with removing GLib bindings of them.

            kou Kouhei Sutou added a comment - Thanks. I understand. I agree with removing GLib bindings of them.
            kou Kouhei Sutou added a comment -

            Issue resolved by pull request 10533
            https://github.com/apache/arrow/pull/10533

            kou Kouhei Sutou added a comment - Issue resolved by pull request 10533 https://github.com/apache/arrow/pull/10533
            rokm Rok Mihevc added a comment -

            This issue has been migrated to issue #27634 on GitHub. Please see the migration documentation for further details.

            rokm Rok Mihevc added a comment - This issue has been migrated to issue #27634 on GitHub. Please see the migration documentation for further details.

            People

              kou Kouhei Sutou
              bkietz Ben Kietzman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m