Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7048

[Java] Support for combining multiple vectors under VectorSchemaRoot

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.17.0
    • Java

    Description

      Hi,

       

      pyarrow.Table.combine_chunks provides a nice functionality of combining multiple batch records under a single pyarrow.Table.

       

      I am currently working on a downstream application which reads data from BigQuery. BigQuery storage api supports data output in Arrow format but streams data in many batches of size 1024 or less number of rows.

      It would be really nice to have Arrow Java api provide this functionality under an abstraction like VectorSchemaRoot.

      After getting guidance from emkornfield@gmail.com, I tried to write my own implementation by copying data vector by vector using TransferPair's copyValueSafe

      But, unless I am missing some thing obvious, turns out it only copies one value at a time. That means a lot of looping trying copyValueSafe millions of rows from source vector index to target vector index. Ideally I would want to concatenate/link the underlying buffers rather than copying one cell at a time.

       

      Eg, if I have :

      List<VectorSchemaRoot> batchList = new ArrayList<>();
      try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
          Schema schema = reader.getVectorSchemaRoot().getSchema();
          for (int i = 0; i < 5; i++) {
              // This will be loaded with new values on every call to loadNextBatch
              VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
              reader.loadNextBatch();
              batchList.add(readBatch);
          }
      }
      
      //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);

       

      A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?

      I did read the VectorSchemaRoot discussion on https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the right thing to use here.

       

       

      PS. Feel free to update the title of this feature request with more appropriate wordings.

       

      Cheers,

      Yogesh

       

       

      Attachments

        Issue Links

          Activity

            People

              fan_li_ya Liya Fan
              yogeshtewari Yogesh Tewari
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m