Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3484

HadoopInputFormatIO reads big datasets invalid

Details

    • Bug
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • 2.3.0, 2.4.0
    • 2.5.0
    • io-java-hadoop-format
    • None

    Description

      For big datasets HadoopInputFormat sometimes skips/duplicates elements from database in resulting PCollection. This gives incorrect read result.

      Occurred to me while developing HadoopInputFormatIOIT and running it on dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 000. 

      Attachments:
       - text file with sorted HadoopInputFormat.read() result saved using TextIO.write().to().withoutSharding(). If you look carefully you'll notice duplicates or missing values that should not happen

       - same text file for 600 000 records not having any duplicates and missing elements

      • link to a PR with HadoopInputFormatIO integration test that allows to reproduce this issue. At the moment of writing, this code is not merged yet.

      Attachments

        1. result_sorted600000
          7.90 MB
          Lukasz Gajowy
        2. result_sorted1000000
          13.25 MB
          Lukasz Gajowy

        Activity

          People

            aromanenko Alexey Romanenko
            ŁukaszG Lukasz Gajowy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 20m
                2h 20m