Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3945

TFRecord Performance Tests doesn't work on hdfs

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • io-java-tfrecord, testing
    • None

    Description

      TFRecord have issue reading files from hdfs using filename pattern "hdfs://...*"

      TFRecordIO.read().from(filenamePattern).withCompression(AUTO)

      link to github this is a blocker for running full set of filebased io tests on hdfs.

      Steps to reproduce:
      1. Create remote hadoop environment. This step asume you have local kubectl tool configured to use your GCP project.

      pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./setup-all.sh && popd
      

      2. Update /etc/hosts file with the provided output from sctipt.
      3. Confirm that it works and hadoop web interface is accessible on http://hadoop-xxxxx:50070 where xxxxx is added in step2 sequence from your /etc/hosts entry. Please also substitute xxxxx in further usages of this.
      4. Tell runner to use root as hadoop user.

      export HADOOP_USER_NAME=root
      

      5. Run TFRecord tests on this environment using DirectRunner:

      mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests/ -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT -Dfilesystem=hdfs -DintegrationTestPipelineOptions='["--filenamePrefix=hdfs://hadoop-xxxxx:9000/TFRecord", "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://hadoop-xxxxx:9000\", \"dfs.replication\": 1, \"dfs.client.use.datanode.hostname\":\"true\"}]" ]' -DforceDirectRunner=true 
      

      The error message is:

      [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 78.055 s <<< FAILURE! - in org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
      [ERROR] writeThenReadAll(org.apache.beam.sdk.io.tfrecord.TFRecordIOIT)  Time elapsed: 78.055 s  <<< ERROR!
      java.lang.IllegalStateException: Invalid data
              at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
              at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
              at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
              at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
              at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
              at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267)
              at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148)
              at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
              at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      
      

      This results were also observed when running tests on jenkins. Link to jenkins build

      When you open http://hadoop-xxxxx:50070/explorer.html#/ you will see TFRecord files that were created during write phase. Unable to be processed in reading phase.

      Important note: if I copy files made by writing pipeline from hdfs directory to local directory and run reading pipeline over them, everything is working fine, so only reading from hdfs is a problem.

      You can wipe out hdfs environment by runnning:

      pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./teardown-all.sh && popd
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            szewinho Kamil Szewczyk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: