Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18063

[C++][Python] Custom streaming data providers in {{run_query}}

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      Mailing list thread

      The goal is to:

      • generate a substrait plan in Python using Ibis
      • ... wherein tables are specified using custom URLs
      • use the python API run_query to execute the plan
      • ... against source data which is streamed from those URLs rather than pulled fully into local memory

      The obstacles include:

      • The API for constructing a data stream from the custom URLs is only available in c++
      • The python run_query function requires tables as input and cannot accept a RecordBatchReader even if one could be constructed from a custom URL
      • Writing custom cython is not preferred

      Some potential solutions:

      • Use ExecuteSerializedPlan() directly usable from c++ so that construction of data sources need not be handled in python. Passing a buffer from python/ibis down to C++ is much simpler and can be navigated without writing cython
      • Refactor NamedTableProvider from a lambda mapping names -> data source into a registry so that data source factories can be added from c++ then referenced by name from python
      • Extend run_query to support non-Table sources and require the user to write a python mapping from URLs to pa.RecordBatchReader

      Attachments

        Activity

          People

            Unassigned Unassigned
            bkietz Ben Kietzman
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: