Details

    Description

      A feature request. I've seen this pop up in a few places. Want to have a record of discussion on this topic. 

      I may be open to contributing this, but first need some general guidance on approach so I can understand effort level. 

      It looks like there is not a good tool available for GObject Introspection binding to .NET so the easy pathway via Arrow Glib C API appears to be closed. 

      The only GObject integration for .NET appears to be Mono GAPI

      http://www.mono-project.com/docs/gui/gtksharp/gapi/

      From what I can see this produces a GIR or similar XML, then generates C# code directly from that. Likely involves many manual fix ups of the XML. Worth a try? 

       

      Alternatively I could look at generating some other direct binding from .NET to C/C++. Where I work we use Swig http://www.swig.org/. Good for vanilla cases, requires hand crafting of the .i files and specialized marshalling strategies for optimizing performance critical cases. 

      Haven't tried CppSharp but it looks more appealing than Swig in some ways https://github.com/mono/CppSharp/wiki/Users-Manual

      In either case, not sure if better to use Glib C API or C++ API directly. What would be pros/cons? 

       

       

       

       

       

      Attachments

        Issue Links

          Activity

            wesm Wes McKinney added a comment -

            My guess would be that building a smallish native implementation of the Arrow columnar data structures would be one good path to best support and performance for .NET users, as well as to make development more accessible for C# folks. I'm not an expert though. A binding layer could be developed to use the C++ libraries in C#.

            wesm Wes McKinney added a comment - My guess would be that building a smallish native implementation of the Arrow columnar data structures would be one good path to best support and performance for .NET users, as well as to make development more accessible for C# folks. I'm not an expert though. A binding layer could be developed to use the C++ libraries in C#.
            Jamie Elliott Jamie Elliott added a comment -

            Thanks Wes, can you please elaborate a little. What could be the performance advantages of a native implementation over a binding layer? In terms of a reference implementation for .NET, would the Java implementation be the closest model? 

            Jamie Elliott Jamie Elliott added a comment - Thanks Wes, can you please elaborate a little. What could be the performance advantages of a native implementation over a binding layer? In terms of a reference implementation for .NET, would the Java implementation be the closest model? 
            wesm Wes McKinney added a comment - - edited

            Again, I'm not an expert, but I think that going through C bindings to a C++ library would prevent the CLR runtime from generating possibly better code on hot paths.

            I'm not sure whether the Java or C++ implementation is going to be a better model for a .NET library. The Java codebase grew organically originally within Apache Drill until the formation of Apache Arrow. The C++ library was started from scratch at beginning of 2016

            wesm Wes McKinney added a comment - - edited Again, I'm not an expert, but I think that going through C bindings to a C++ library would prevent the CLR runtime from generating possibly better code on hot paths. I'm not sure whether the Java or C++ implementation is going to be a better model for a .NET library. The Java codebase grew organically originally within Apache Drill until the formation of Apache Arrow. The C++ library was started from scratch at beginning of 2016
            kou Kouhei Sutou added a comment -

            Thanks for thinking .NET support! From my point of view:

            GLib C API based bindings:

            Pros:

            • We can get well workable .NET library with the least works if Mono GAPI works well.
              • We can work together for Ruby library and .NET library.

            Cons:

            • There is overhead than C++ API based bindings.
            • We don't try building GLib C API on Windows yet. (I need Windows environment...)
              • I'll work on it eventually.

            C++ API based bindings:

            Pros:

            • We may get well workable .NET library with less works if Mono CppSharp works well. (It seems that it works well.)
            • We can use optimization in C++ API. (LLVM based JIT is planed.)

            Cons:

            • It may be slower than C# native implementation.
            • It may be difficult to use than C# native implementation.
            • It may be difficult to install than C# native implementation.

            C# native implementation:

            Pros:

            • It may be faster than bindings.
            • It will be easy to use.
            • It will be easy to install.

            Cons:

            • We need many works.
              • We can implement step by step.
            kou Kouhei Sutou added a comment - Thanks for thinking .NET support! From my point of view: GLib C API based bindings: Pros: We can get well workable .NET library with the least works if Mono GAPI works well. We can work together for Ruby library and .NET library. Cons: There is overhead than C++ API based bindings. We don't try building GLib C API on Windows yet. (I need Windows environment...) I'll work on it eventually. C++ API based bindings: Pros: We may get well workable .NET library with less works if Mono CppSharp works well. (It seems that it works well.) We can use optimization in C++ API. (LLVM based JIT is planed.) Cons: It may be slower than C# native implementation. It may be difficult to use than C# native implementation. It may be difficult to install than C# native implementation. C# native implementation: Pros: It may be faster than bindings. It will be easy to use. It will be easy to install. Cons: We need many works. We can implement step by step.
            Jamie Elliott Jamie Elliott added a comment - - edited

            I've given a little more thought to the idea of a native C# implementation. I found the C++ implementation the easiest to understand. 

            Considering a narrow proof of concept that would replicate the classes in arrow/cpp/src/arrow/ but not subfolders. 

            Hopefully enough to replicate example https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html

            It seems to me the scope of that is manageable and there are some more or less ready made components in corefx. 

            MemoryPool

            [C++ MemoryPool|https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.h] can be replicated via 

            [C# MemoryPool|https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/MemoryPool.cs]. 

            Maybe start with a built in Memory Pool, that allocates a large block of managed memory and pins https://github.com/aspnet/Common/tree/dev/shared/Microsoft.Extensions.Buffers.MemoryPool.Sources

            Alternatively could PInvoke Arrow C++ Allocator. 

            Another interesting point of reference is https://github.com/allisterb/jemalloc.NET

            Buffer

            [C++ Buffer|https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h] can likely be replicated by something built on top of Memory<T>. Span<T> and Memory<T> are used for 0 copy slicing 

            https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

            Array

            https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h

            Builds naturally from Buffer.

            Note that ArrayVector = std::vector<std::shared_ptr<Array>>;

            ChunkedArray 

            A data structure managing a list of primitive Arrow arrays logically as one large array

            https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h

            Compare to https://github.com/dotnet/corefx/blob/master/src/System.IO.Pipelines/src/System/IO/Pipelines/BufferSegment.cs

             

             

            Note one assumption is that in general std::shared_ptr<T> can be replaced by just T in C# managed classes. 

            Gotta run now, more to follow...

             

             

             

             

             

            Jamie Elliott Jamie Elliott added a comment - - edited I've given a little more thought to the idea of a native C# implementation. I found the C++ implementation the easiest to understand.  Considering a narrow proof of concept that would replicate the classes in arrow/cpp/src/arrow/ but not subfolders.  Hopefully enough to replicate example https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html It seems to me the scope of that is manageable and there are some more or less ready made components in corefx.  MemoryPool [C++ MemoryPool| https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.h ] can be replicated via  [C# MemoryPool| https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/MemoryPool.cs ].  Maybe start with a built in Memory Pool, that allocates a large block of managed memory and pins  https://github.com/aspnet/Common/tree/dev/shared/Microsoft.Extensions.Buffers.MemoryPool.Sources Alternatively could PInvoke Arrow C++ Allocator.  Another interesting point of reference is https://github.com/allisterb/jemalloc.NET Buffer [C++ Buffer| https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h ] can likely be replicated by something built on top of Memory<T>. Span<T> and Memory<T> are used for 0 copy slicing  https://msdn.microsoft.com/en-us/magazine/mt814808.aspx Array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h Builds naturally from Buffer. Note that ArrayVector = std::vector<std::shared_ptr<Array>>; ChunkedArray   A data structure managing a list of primitive Arrow arrays logically as one large array https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h Compare to https://github.com/dotnet/corefx/blob/master/src/System.IO.Pipelines/src/System/IO/Pipelines/BufferSegment.cs     Note one assumption is that in general std::shared_ptr<T> can be replaced by just T in C# managed classes.  Gotta run now, more to follow...          
            Jamie Elliott Jamie Elliott added a comment -

            I wanted to say - I already thought of a cool name for a .NET Arrow implementation. Can anyone guess? 

            Jamie Elliott Jamie Elliott added a comment - I wanted to say - I already thought of a cool name for a .NET Arrow implementation. Can anyone guess? 
            uwe Uwe Korn added a comment -

            Thanks Jamie Elliott for taking a look at this! Some feedback to your proposal:

            • MemoryPool: I guess it is best to go with https://github.com/allisterb/jemalloc.NET A speciality we have in Arrow is that we want to have 64byte-aligned allocations to make the most use out of SIMD instruction set. I don't see that there is an interface in the linked ones that provides aligned allocation and especially aligned reallocation.
            • Buffer: At a first glance the mentioned .net interfaces here seem to fit quite good to what we have in the C++ implementation.
            • ChunkedArray: Not sure how much you would want to build upon BufferSegment. It provides a nice .net-style scaffold for an API for ChunkedArray but it still a bit away from what ChunkedArray really is about.

            Cool name: No good idea but in your text you naturally used the word narrow. But this is probably more a bad pun that a good name

            uwe Uwe Korn added a comment - Thanks Jamie Elliott for taking a look at this! Some feedback to your proposal: MemoryPool: I guess it is best to go with https://github.com/allisterb/jemalloc.NET A speciality we have in Arrow is that we want to have 64byte-aligned allocations to make the most use out of SIMD instruction set. I don't see that there is an interface in the linked ones that provides aligned allocation and especially aligned reallocation. Buffer: At a first glance the mentioned .net interfaces here seem to fit quite good to what we have in the C++ implementation. ChunkedArray: Not sure how much you would want to build upon BufferSegment. It provides a nice .net-style scaffold for an API for ChunkedArray but it still a bit away from what ChunkedArray really is about. Cool name: No good idea but in your text you naturally used the word narrow . But this is probably more a bad pun that a good name
            wesm Wes McKinney added a comment -

            I have renamed this JIRA since we are receiving a donation of a native C# .NET implementation

            wesm Wes McKinney added a comment - I have renamed this JIRA since we are receiving a donation of a native C# .NET implementation
            Jamie Elliott Jamie Elliott added a comment -

            Hey! Sorry I let this slide. I had more or less clear in my head what I was planning but just got too busy with my day job. If someone is donating that is very exciting. Can you give any more details? 

            BTW - the name I was going to suggest was SharpArrow. 

            Jamie Elliott Jamie Elliott added a comment - Hey! Sorry I let this slide. I had more or less clear in my head what I was planning but just got too busy with my day job. If someone is donating that is very exciting. Can you give any more details?  BTW - the name I was going to suggest was SharpArrow. 
            wesm Wes McKinney added a comment -

            See https://github.com/apache/arrow/pull/2815. I think it's just going to be Apache.Arrow in C#

            wesm Wes McKinney added a comment - See https://github.com/apache/arrow/pull/2815 . I think it's just going to be Apache.Arrow in C#
            wesm Wes McKinney added a comment - Resolved in https://github.com/apache/arrow/commit/940254200febf017b4d912912f836cddeb76ee0b
            rokm Rok Mihevc added a comment -

            This issue has been migrated to issue #15733 on GitHub. Please see the migration documentation for further details.

            rokm Rok Mihevc added a comment - This issue has been migrated to issue #15733 on GitHub. Please see the migration documentation for further details.

            People

              Unassigned Unassigned
              Jamie Elliott Jamie Elliott
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m