Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12502

ib.collect fails to materialize named DeferredDataFrame instances

Details

    Description

      In the below example, note that we return an empty dataframe for ib.collect(deferred_df), but ib.collect(to_dataframe(rows)) works as expected.

      In [1]: import numpy as np                                                                                          
         ...: import pandas as pd                                                                                         
         ...:
         ...: import apache_beam as beam                                                                                                                                                                                                      
         ...: from apache_beam import Create, Map                                                                                                                                                                                             
         ...: from apache_beam.dataframe.convert import to_dataframe                                                                                                                                                                          
         ...: from apache_beam.dataframe.convert import to_pcollection                                                                                                                                                                        
         ...: from apache_beam.runners.interactive.interactive_runner import InteractiveRunner                                                                                                                                                
         ...: import apache_beam.runners.interactive.interactive_beam as ib                                                                                                                                                                   
                                                                                                                                                                                                                                              
      In [2]: birds = [                                                                                                                                                                                                                       
         ...:     {                                                                                                                                                                                                                           
         ...:       "name": "American crow",                                                                                                                                                                                                  
         ...:       "scientific_name": "Corvus brachyrhynchos",                                                                                                                                                                               
         ...:       "order": "Passeriformes",                                                                                                                                                                                                 
         ...:       "family": "Corvidae"                                                                                                                                                                                                      
         ...:     },                                                                                                                                                                                                                          
         ...:     {                                                                                                                                                                                                                           
         ...:       "name": "Canada goose",                                                                                                                                                                                                   
         ...:       "scientific_name": "Branta canadensis",                                                                                                                                                                                   
         ...:       "order": "Anseriformes",                                                                                                                                                                                                  
         ...:       "family": "Anatidae"                                                                                                                                                                                                      
         ...:     },                                                                                                                                                                                                                          
         ...:     {                                                                                                                                                                                                                           
         ...:       "name": "mallard",                                                                                                                                                                                                        
         ...:       "scientific_name": "Anas platyrhynchos",                                                                                                                                                                                  
         ...:       "order": "Anseriformes",                                                                                                                                                                                                  
         ...:       "family": "Anatidae"                                                                                                                                                                                                      
         ...:     }                                                                                                                                                                                                                           
         ...: ]                                                                                                                                                                                                                               
                                                                                                                                                                                                                                              
      In [3]: # create an interactive pipeline                                                                                                                                                                                                
         ...: p = beam.Pipeline(InteractiveRunner())                                                                                                                                                                                          
         ...:                                                                                                                                                                                                                                 
         ...:                                                                                                                                                                                                                                 
         ...: # create some pipeline data and map it to rows                                                                                                                                                                                  
         ...: rows = (p | "Create elements" >> Create(birds)                                                                                                                                                                                  
         ...:           | "To rows" >> Map(lambda bird: beam.Row(                                                         
         ...:               common_name=bird["name"],           
         ...:               scientific_name=bird["scientific_name"],                                                      
         ...:               order=bird["order"],                
         ...:               family=bird["family"])))            
      WARNING:apache_beam.runners.interactive.interactive_environment:You have limited Interactive Beam features since your ipython kernel is not connected to any notebook frontend.
      
      
      In [4]: ib.collect(rows)                                                                                                                                                                                                                
      'Processing...'                                                                                                                                                                                                                         
      'Done.'                                                                                                                                                                                                                                 
      Out[4]:                                                                                                                                                                                                                                 
           common_name        scientific_name          order    family                                                                                                                                                                        
      0  American crow  Corvus brachyrhynchos  Passeriformes  Corvidae                                                                                                                                                                        
      1   Canada goose      Branta canadensis   Anseriformes  Anatidae                                                                                                                                                                        
      2        mallard     Anas platyrhynchos   Anseriformes  Anatidae                                                                                                                                                                        
                                                                                                                                                                                                                                              
      In [5]: ib.collect(to_dataframe(rows))                                                                                                                                                                                                  
      'Processing...'                                                                                                                                                                                                                         
      'Done.'                                                                                                                                                                                                                                 
      Out[5]:                                                                                                                                                                                                                                 
           common_name        scientific_name          order    family                                                                                                                                                                        
      0  American crow  Corvus brachyrhynchos  Passeriformes  Corvidae                                                                                                                                                                        
      1   Canada goose      Branta canadensis   Anseriformes  Anatidae                                                                                                                                                                        
      2        mallard     Anas platyrhynchos   Anseriformes  Anatidae                                                                                                                                                                        
                                                                                                                                                                                                                                              
      In [6]: deferred_df = to_dataframe(rows)                                                                                                                                                                                                
                                                                                                                                                                                                                                              
      In [7]: ib.collect(deferred_df)                                                                                                                                                                                                         
      'Processing...'                                                                                                                                                                                                                         
      'Done.'                                                                                                                                                                                                                                 
      Out[7]:                                                                                                                                                                                                                                 
      Empty DataFrame                                                                                                                                                                                                                         
      Columns: []                                                                                                                                                                                                                             
      Index: []       
      

      Attachments

        Activity

          People

            rohdesam Sam Rohde
            bhulette Brian Hulette
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: