Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1720

Add an avro-tool to count records in an avro file

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10.0
    • Component/s: java
    • Labels:

      Description

      If you're dealing with bigger avro files (>100MB) it would be nice to have a way to quickly count the amount of records contained within that file.

      With the current state of avro-tools the only way to achieve this (to my current knowledge) is to dump the data to json and count the amount of records. For bigger files this might take a while due to the serialization overhead and since every record needs to be looked at.

      I added a new tool which is optimized for counting records, it does not serialize the records and reads only the block count for each block.

      Naive benchmark
      # the input file had a size of ~300MB
      $ du -sh sample.avro 
      323M    sample.avro
      
      # using the new count tool
      $ time java -jar avro-tools.jar count sample.avro
      331439
      
      real    0m4.670s
      user    0m6.167s
      sys 0m0.513s
      
      # the current way of counting records
      $ time java -jar avro-tools.jar tojson sample.avro | wc
      331439 54904484 1838231743
      
      real    0m52.760s
      user    1m42.317s
      sys 0m3.209s
      
      # the overhead of wc is rather minor
      $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
      
      real    0m47.834s
      user    0m53.317s
      sys 0m1.194s
      

      This tool uses the HDFS API to handle files from any supported filesystem. I added the unit tests to the already existing TestDataFileTools since it provided convenient utility functions which I could reuse for my test scenarios.

        Attachments

        1. AVRO-1720.patch
          5 kB
          Janosch Woschitz
        2. AVRO-1720-with-extended-unittests.patch
          7 kB
          Janosch Woschitz

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jwoschitz Janosch Woschitz
              • Votes:
                5 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: