Avro
  1. Avro
  2. AVRO-1286

Python script avro cat should be able to read from stdin

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: python
    • Labels:
      None

      Description

      Currently, you have to specify a target file on the command line. But it would be nice to be able to stream data through avro cat.

        Issue Links

          Activity

          Uri Laserson created issue -
          Hide
          Jeremy Kahn added a comment -

          Biggest headache here is that the python avro data file library requires that the file be seekable. Standard in is not seekable.

          I think this is a bug or a misfeature in the python library and probably deserves a ticket of its own.

          Show
          Jeremy Kahn added a comment - Biggest headache here is that the python avro data file library requires that the file be seekable. Standard in is not seekable. I think this is a bug or a misfeature in the python library and probably deserves a ticket of its own.
          Hide
          Harsh J added a comment -

          Dupe of AVRO-959?

          Show
          Harsh J added a comment - Dupe of AVRO-959 ?
          Hide
          Jeremy Kahn added a comment -

          Oops. my request was a dupe of AVRO-959. This issue should be considered blocked by AVRO-959. Thanks Harsh.

          Show
          Jeremy Kahn added a comment - Oops. my request was a dupe of AVRO-959 . This issue should be considered blocked by AVRO-959 . Thanks Harsh.
          Harsh J made changes -
          Field Original Value New Value
          Link This issue is related to AVRO-959 [ AVRO-959 ]
          Hide
          Scott Nottingham added a comment -

          What you are trying to do can be easily accomplished as follows:
          import cStringIO
          file_like_obj = cStringIO.StringIO()
          file_like_obj.write(sys.stdin.read())
          file_like_obj.seek(0)

          now you can pass this file_like_obj into avro's read method.

          Show
          Scott Nottingham added a comment - What you are trying to do can be easily accomplished as follows: import cStringIO file_like_obj = cStringIO.StringIO() file_like_obj.write(sys.stdin.read()) file_like_obj.seek(0) now you can pass this file_like_obj into avro's read method.
          Hide
          Uri Laserson added a comment -

          I've been a little out of the loop with Avro Python development, but do you mean pass the file_like_object to the DataFileReader constructor? But IIRC, this will then seek to the end of the file in order to get the number of bytes to expect. What happens if you seek to the end of the file in this case? Will it try to buffer the entire input stream in memory?

          Show
          Uri Laserson added a comment - I've been a little out of the loop with Avro Python development, but do you mean pass the file_like_object to the DataFileReader constructor? But IIRC, this will then seek to the end of the file in order to get the number of bytes to expect. What happens if you seek to the end of the file in this case? Will it try to buffer the entire input stream in memory?
          Hide
          Scott Nottingham added a comment -

          I was just trying to challenge the OP comment that "Standard in is not seekable". Using the method I posted, something from the command line can be piped ( | ) to a python script as standard in and used by the DataFileReader constructor as if it were a file being read. This could be useful in map/reduce jobs where Hadoop will try to send mapped data to a reduce python script as stdin. For a continuous stream ( as it sounds like you are describing ) I don't think this method would work.

          Show
          Scott Nottingham added a comment - I was just trying to challenge the OP comment that "Standard in is not seekable". Using the method I posted, something from the command line can be piped ( | ) to a python script as standard in and used by the DataFileReader constructor as if it were a file being read. This could be useful in map/reduce jobs where Hadoop will try to send mapped data to a reduce python script as stdin. For a continuous stream ( as it sounds like you are describing ) I don't think this method would work.
          Hide
          Sean Jensen-Grey added a comment -

          @nottings your solution works for small Avros because it reads the whole thing in memory. The Python library should be able to extract avro records from a non-seekable stream.

          Show
          Sean Jensen-Grey added a comment - @nottings your solution works for small Avros because it reads the whole thing in memory. The Python library should be able to extract avro records from a non-seekable stream.
          Hide
          Alexander Hasha added a comment -

          Has anyone thought any more about this recently? I'm looking at this issue for my own purposes. As far as I can tell so far, the calls to `seek` are not inherently necessary to parsing the data stream. There is one seek to determine the file length, but that looks like a convenience method for determining if the end of the file has been reached. (You can tell when that happens on a stream fairly easily.) You do need to seek backwards by `SYNC_SIZE`, but it seems like this could be accomplished by buffering a whole number of blocks in memory, not necessarily the whole file.

          I'd like to give this a shot, but am worried I'm failing to understand an important detail.

          Show
          Alexander Hasha added a comment - Has anyone thought any more about this recently? I'm looking at this issue for my own purposes. As far as I can tell so far, the calls to `seek` are not inherently necessary to parsing the data stream. There is one seek to determine the file length, but that looks like a convenience method for determining if the end of the file has been reached. (You can tell when that happens on a stream fairly easily.) You do need to seek backwards by `SYNC_SIZE`, but it seems like this could be accomplished by buffering a whole number of blocks in memory, not necessarily the whole file. I'd like to give this a shot, but am worried I'm failing to understand an important detail.
          Hide
          Sean Jensen-Grey added a comment -

          I really haven't.

          When I entered Avro-959 I wanted to use python in a streaming job to process avro files. The whole stack was broken with respect to that goal, not just seeking in the avro reader code. I made a 4 hour attempt at using a buffer and removing the seek calls. I personally wouldn't spend a ton of time on it.

          It might be better to use https://cffi.readthedocs.org/en/release-0.8/ to interface to the C Avro implementation. One gets speed, less memory overhead and library usable across all python implementations.

          Show
          Sean Jensen-Grey added a comment - I really haven't. When I entered Avro-959 I wanted to use python in a streaming job to process avro files. The whole stack was broken with respect to that goal, not just seeking in the avro reader code. I made a 4 hour attempt at using a buffer and removing the seek calls. I personally wouldn't spend a ton of time on it. It might be better to use https://cffi.readthedocs.org/en/release-0.8/ to interface to the C Avro implementation. One gets speed, less memory overhead and library usable across all python implementations.
          Hide
          Uri Laserson added a comment -

          I have used this as a workaround:

          https://gist.github.com/laserson/8941547

          Show
          Uri Laserson added a comment - I have used this as a workaround: https://gist.github.com/laserson/8941547

            People

            • Assignee:
              Unassigned
              Reporter:
              Uri Laserson
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development