Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11513

Unable to load Pandas dataframe from BigQuery

Details

    • Bug
    • Status: Resolved
    • P3
    • Resolution: Duplicate
    • 2.26.0
    • Not applicable
    • dsl-dataframe
    • None

    Description

      Doing this:

      query = """
      SELECT  
        airline,
        departure_airport,
        arrival_airport,
        departure_delay,
        arrival_delay
      FROM `bigquery-samples.airline_ontime_data.flights`
      """
      with beam.Pipeline() as p:
          tbl = p | 'read table' >> beam.io.ReadFromBigQuery(query=query)
          tbl = tbl | 'assign ts' >> beam.Map(
              lambda x: beam.window.TimestampedValue(x, to_unixtime(x['date'])))
          daily = tbl | 'daily windows' >> beam.WindowInto(beam.window.FixedWindows(60*60*24))
          df = to_dataframe(daily)
          result = df.groupby('airline').apply(get_delay_at_top_airports)
          result.to_csv('output.csv')

      returns this error:

      ---------------------------------------------------------------------------
      TypeError                                 Traceback (most recent call last)
      <ipython-input-41-f47b2674428f> in <module>
           13         lambda x: beam.window.TimestampedValue(x, to_unixtime(x['date'])))
           14     daily = tbl | 'daily windows' >> beam.WindowInto(beam.window.FixedWindows(60*60*24))
      ---> 15     df = to_dataframe(daily)
           16     result = df.groupby('airline').apply(get_delay_at_top_airports)
           17     result.to_csv('output.csv')
      
      /opt/conda/lib/python3.7/site-packages/apache_beam/dataframe/convert.py in to_dataframe(pcoll, proxy, label)
           69       # the name of these variables in the calling context.
           70       label = 'BatchElements(%s)' % _var_name(pcoll, 2)
      ---> 71     proxy = schemas.generate_proxy(pcoll.element_type)
           72     pcoll = pcoll | label >> schemas.BatchRowsAsDataFrame(proxy=proxy)
           73   return frame_base.DeferredFrame.wrap(
      
      /opt/conda/lib/python3.7/site-packages/apache_beam/dataframe/schemas.py in generate_proxy(element_type)
          178 
          179   else:
      --> 180     fields = named_fields_from_element_type(element_type)
          181     proxy = pd.DataFrame(columns=[name for name, _ in fields])
          182     for name, typehint in fields:
      
      /opt/conda/lib/python3.7/site-packages/apache_beam/typehints/schemas.py in named_fields_from_element_type(element_type)
          298 def named_fields_from_element_type(
          299     element_type):  # (type) -> typing.List[typing.Tuple[unicode, type]]
      --> 300   return named_fields_from_schema(schema_from_element_type(element_type))
          301 
          302 
      
      /opt/conda/lib/python3.7/site-packages/apache_beam/typehints/schemas.py in schema_from_element_type(element_type)
          293     raise TypeError(
          294         "Attempted to determine schema for unsupported type '%s'" %
      --> 295         element_type)
          296 
          297 
      
      TypeError: Attempted to determine schema for unsupported type 'Any'

      Attachments

        Issue Links

          Activity

            People

              bhulette Brian Hulette
              lak@vlakshman.com Valliappa Lakshmanan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: