Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1307

Add an avro-tool to extract samples from avro files

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.5
    • Component/s: java
    • Labels:
      None
    • Environment:

      java

    • Release Note:
      new avro-tool cat that picks only some records from avro files

      Description

      It would be nice to have an avro-tool that picks only some records from avro files.

      I implemented a new avro-tool cat, which takes a list of avro files with identical schemas and concatenates them into a single file, with options to discard the first n records, to limit the output size and to collect records at a certain samplerate.

      This tool allows a quicker peek into large avro files, e.g.:

      java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
      # creates output.avro that contains records
      # 51 to 60 from input.avro.
      
      java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate .01
      # samples every hundredth record from input,
      # beginning at the 1000th record and limiting
      # the output to 100 records. 
      

      The tool allows multiple input files or folders, in which case all files inside the folder will be used for input.

      java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
      # reads all the files from the data folder and
      # writes every 100th record into the output file.
      

      This tool uses the hadoop FileSystem api to handle files from any supported filesystem.

        Attachments

        1. AVRO-1307.patch
          9 kB
          Vincenz Priesnitz
        2. AVRO-1307-addedUnitTests-fixed.patch
          24 kB
          Vincenz Priesnitz

          Activity

            People

            • Assignee:
              vince83 Vincenz Priesnitz
              Reporter:
              vince83 Vincenz Priesnitz
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: