Sqoop
  1. Sqoop
  2. SQOOP-671

Mapreduce counters are not used in generated mapreduce jobs

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 1.99.1
    • Component/s: None
    • Labels:
      None

      Description

      As we're using threads to pass data instead of hadoop native way, we're loosing some counters (bytes written, number of entries) that might be interested for end user. We should propagate those counters ourselves.

      1. SQOOP-671-6.patch
        6 kB
        Hari Shreedharan
      2. SQOOP-671-5.patch
        15 kB
        Hari Shreedharan
      3. SQOOP-671-4.patch
        14 kB
        Hari Shreedharan
      4. SQOOP-671-3.patch
        12 kB
        Hari Shreedharan
      5. SQOOP-671-2.patch
        12 kB
        Hari Shreedharan
      6. SQOOP-671-1.patch
        10 kB
        Hari Shreedharan

        Issue Links

          Activity

          Hide
          Hari Shreedharan added a comment -

          I don't understand losing "some counters." As far as I can see the counters are not being pulled in anywhere. This is not related to the way the job is submitted right? It is just that we don't really wait for completion - and the querying needs to happen when a request is made from the client, so when the stats calls is made, we query the MR job to get the stats.

          Show
          Hari Shreedharan added a comment - I don't understand losing "some counters." As far as I can see the counters are not being pulled in anywhere. This is not related to the way the job is submitted right? It is just that we don't really wait for completion - and the querying needs to happen when a request is made from the client, so when the stats calls is made, we query the MR job to get the stats.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Hari,
          thank you for your question. Please accept my apology for not being descriptive enough in the first place. We're currently overriding Mapper.run() and bypassing usual record handling through context.write(). As a result we're missing mapreduce counters that are generated during the default process. But other counters should be intact (like number of spawned mappers, number of spawned reducers, ...). This means that we're currently not able to tell how many records (rows) we've imported or how many bytes we've transferred. That's what I meant by "some counters are lost" - it should be "some default mapreduce counters are lost".

          The reason for that is current implementation of mapreduce execution (not submission) engine and I believe that it needs to be fixed there. It's completely independent on way how you're submitting the job to the cluster (thus independent on submission engine). Even more, there is already a callback in submission engine that is querying counters after given submission finish, but it's always returning null at the moment. Please note that this "second" issue is covered by SQOOP-678.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Hari, thank you for your question. Please accept my apology for not being descriptive enough in the first place. We're currently overriding Mapper.run() and bypassing usual record handling through context.write(). As a result we're missing mapreduce counters that are generated during the default process. But other counters should be intact (like number of spawned mappers, number of spawned reducers, ...). This means that we're currently not able to tell how many records (rows) we've imported or how many bytes we've transferred. That's what I meant by "some counters are lost" - it should be "some default mapreduce counters are lost". The reason for that is current implementation of mapreduce execution (not submission) engine and I believe that it needs to be fixed there. It's completely independent on way how you're submitting the job to the cluster (thus independent on submission engine). Even more, there is already a callback in submission engine that is querying counters after given submission finish, but it's always returning null at the moment. Please note that this "second" issue is covered by SQOOP-678 . Jarcec
          Hide
          Hari Shreedharan added a comment -

          Thanks Jarcec. Submitted a first cut patch to do this.

          Show
          Hari Shreedharan added a comment - Thanks Jarcec. Submitted a first cut patch to do this.
          Hide
          Hari Shreedharan added a comment -

          I am not entirely happy with the way we are handling writes to the FS. There is too much of uncertainity in the code - especially in the SqoopOutputFormatLoadExecutor class. This patch for now provides the byte count when the format() method is called - assuming that at that point the data has been "transferred." This patch, like everything else in sqoop 2 currently supports only the CSV. I intend to update the way the data is written out in a future patch: SQOOP-691.

          Show
          Hari Shreedharan added a comment - I am not entirely happy with the way we are handling writes to the FS. There is too much of uncertainity in the code - especially in the SqoopOutputFormatLoadExecutor class. This patch for now provides the byte count when the format() method is called - assuming that at that point the data has been "transferred." This patch, like everything else in sqoop 2 currently supports only the CSV. I intend to update the way the data is written out in a future patch: SQOOP-691 .
          Hide
          Hari Shreedharan added a comment -

          It seems like Hadoop 1 does not have a way of exposing counters from the output format classes, though hadoop 2 does have this functionality. So for now, I am removing the bytes transferred part and leaving just rows read. We will revisit the bytes transferred later, when it is possible to update counters from the output format class. Hope this makes sense. Thanks.

          Show
          Hari Shreedharan added a comment - It seems like Hadoop 1 does not have a way of exposing counters from the output format classes, though hadoop 2 does have this functionality. So for now, I am removing the bytes transferred part and leaving just rows read. We will revisit the bytes transferred later, when it is possible to update counters from the output format class. Hope this makes sense. Thanks.
          Hide
          Hari Shreedharan added a comment -

          Latest patch from review board.

          Show
          Hari Shreedharan added a comment - Latest patch from review board.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Hari,
          I agree with your conclusion and I've committed your last patch: https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=commit;h=2c9a4eb46c8e51834be946439b40e3116203581a

          Thank you very much for your contribution!

          jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Hari, I agree with your conclusion and I've committed your last patch: https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=commit;h=2c9a4eb46c8e51834be946439b40e3116203581a Thank you very much for your contribution! jarcec

            People

            • Assignee:
              Hari Shreedharan
              Reporter:
              Jarek Jarcec Cecho
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development