Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-968

SequenceFileHdfsFileWriter does not close file properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.10.0, 0.10.1
    • 0.10.1
    • container
    • None
    • Patch

    Description

      From dev@samza.apache.org:

      Hi, Benjamin,

      Thanks a lot for reporting this! It makes sense from reading the posts.
      Could you open a JIRA? Are you interested in assigning to yourself and
      contribute the fix?

      Thanks a lot again!

      -Yi

      > Hello,
      >
      > I am working on a project where we are integrating Samza and Hive. As part
      > of this project, we ran into an issue where sequence files written from
      > Samza were taking a long time (hours) to completely sync with HDFS.
      >
      > After some Googling and digging into the code, it appears that the issue
      > is here:
      >
      > https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111
      >
      > Writer.stream(dfs.create(path)) implies that the caller of
      > dfs.create(path) is responsible for closing the created stream explicitly.
      > This doesn't happen, and the SequenceFileHdfsWriter call to close will only
      > flush the stream.
      >
      > I believe the correct line should be:
      >
      > Writer.file(path)
      >
      > Or, SequenceFileHdfsWriter should explicitly track and close the stream.
      >
      > Thanks!
      >
      > Ben
      >
      > Refernece material:
      >
      > http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
      >
      > https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238

      Attachments

        1. SAMZA-968.patch
          0.8 kB
          Benjamin Smith

        Activity

          People

            MLBenjii Benjamin Smith
            MLBenjii Benjamin Smith
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified