Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1720

Add an avro-tool to count records in an avro file

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.10.0
    • java

    Description

      If you're dealing with bigger avro files (>100MB) it would be nice to have a way to quickly count the amount of records contained within that file.

      With the current state of avro-tools the only way to achieve this (to my current knowledge) is to dump the data to json and count the amount of records. For bigger files this might take a while due to the serialization overhead and since every record needs to be looked at.

      I added a new tool which is optimized for counting records, it does not serialize the records and reads only the block count for each block.

      Naive benchmark
      # the input file had a size of ~300MB
      $ du -sh sample.avro 
      323M    sample.avro
      
      # using the new count tool
      $ time java -jar avro-tools.jar count sample.avro
      331439
      
      real    0m4.670s
      user    0m6.167s
      sys 0m0.513s
      
      # the current way of counting records
      $ time java -jar avro-tools.jar tojson sample.avro | wc
      331439 54904484 1838231743
      
      real    0m52.760s
      user    1m42.317s
      sys 0m3.209s
      
      # the overhead of wc is rather minor
      $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
      
      real    0m47.834s
      user    0m53.317s
      sys 0m1.194s
      

      This tool uses the HDFS API to handle files from any supported filesystem. I added the unit tests to the already existing TestDataFileTools since it provided convenient utility functions which I could reuse for my test scenarios.

      Attachments

        1. AVRO-1720.patch
          5 kB
          Janosch Woschitz
        2. AVRO-1720-with-extended-unittests.patch
          7 kB
          Janosch Woschitz

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jwoschitz Janosch Woschitz
              Votes:
              5 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: