Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1307

Add an avro-tool to extract samples from avro files

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.7.5
    • java
    • None
    • java

    • new avro-tool cat that picks only some records from avro files

    Description

      It would be nice to have an avro-tool that picks only some records from avro files.

      I implemented a new avro-tool cat, which takes a list of avro files with identical schemas and concatenates them into a single file, with options to discard the first n records, to limit the output size and to collect records at a certain samplerate.

      This tool allows a quicker peek into large avro files, e.g.:

      java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
      # creates output.avro that contains records
      # 51 to 60 from input.avro.
      
      java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate .01
      # samples every hundredth record from input,
      # beginning at the 1000th record and limiting
      # the output to 100 records. 
      

      The tool allows multiple input files or folders, in which case all files inside the folder will be used for input.

      java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
      # reads all the files from the data folder and
      # writes every 100th record into the output file.
      

      This tool uses the hadoop FileSystem api to handle files from any supported filesystem.

      Attachments

        1. AVRO-1307-addedUnitTests-fixed.patch
          24 kB
          Vincenz Priesnitz
        2. AVRO-1307.patch
          9 kB
          Vincenz Priesnitz

        Activity

          People

            vince83 Vincenz Priesnitz
            vince83 Vincenz Priesnitz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: