Sqoop
  1. Sqoop
  2. SQOOP-777

Sqoop2: Implement intermediate data format representation policy

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: None
    • Labels:
      None

      Description

      We should enforce our intermediate data format policy to enforce as currently each driver can do it differently and that might break things.

      1. SQOOP-777.patch
        81 kB
        Hari Shreedharan

        Issue Links

          Activity

          Hide
          Hari Shreedharan added a comment -

          I am interested in taking this up. This would be an interesting project. Let's start with a discussion and then I will follow up with a specification based on consensus

          Show
          Hari Shreedharan added a comment - I am interested in taking this up. This would be an interesting project. Let's start with a discussion and then I will follow up with a specification based on consensus
          Hide
          Hari Shreedharan added a comment -

          I am inclined to using Avro (or a thin wrapper over Avro) for this. I am not really a fan of the Object based format we are currently using - I think we need to enforce using a schema and something like Avro makes this quite flexible.

          Show
          Hari Shreedharan added a comment - I am inclined to using Avro (or a thin wrapper over Avro) for this. I am not really a fan of the Object based format we are currently using - I think we need to enforce using a schema and something like Avro makes this quite flexible.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Hari,
          thank you very much for taking over this issue. We've discussed the intermediate format a lot on our earlier design conference calls (meeting minutes are available on the wiki). The avro was one of the possibilities that we've explored. Outcome of those discussions was to use use text format that is very near what mysqldump and pg_dump are producing, so that Sqoop2 performance can be comparable to those tools. The reasoning is that we did not want to force stream oriented connectors to fully parse all the incoming data when going throw the framework into avro format and rather let them "decorate" the stream as it's passing.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Hari, thank you very much for taking over this issue. We've discussed the intermediate format a lot on our earlier design conference calls ( meeting minutes are available on the wiki). The avro was one of the possibilities that we've explored. Outcome of those discussions was to use use text format that is very near what mysqldump and pg_dump are producing, so that Sqoop2 performance can be comparable to those tools. The reasoning is that we did not want to force stream oriented connectors to fully parse all the incoming data when going throw the framework into avro format and rather let them "decorate" the stream as it's passing. Jarcec
          Hide
          Hari Shreedharan added a comment -

          Hi Jarcec,

          I am sorry, I was not a part of the project at this time, so I don't have much background on the discussion at the time. But I definitely do not agree that text is a good intermediate format.

          I am not sure why we should be comparing against mysqldump or pg_dump, and if their performance is due to their format. Since we are primarily interested in reading directly from the db (rather than the dumps), I don't really understand why text would perform better than a binary format like Avro?

          Also by using text, it becomes complex to encode field names and schemas (other than by forcing a JSON like schema or having header like structures).

          I might be wrong on multiple fronts here, but text is inherently expensive anyway - so I don't see much benefit in that either.

          Show
          Hari Shreedharan added a comment - Hi Jarcec, I am sorry, I was not a part of the project at this time, so I don't have much background on the discussion at the time. But I definitely do not agree that text is a good intermediate format. I am not sure why we should be comparing against mysqldump or pg_dump, and if their performance is due to their format. Since we are primarily interested in reading directly from the db (rather than the dumps), I don't really understand why text would perform better than a binary format like Avro? Also by using text, it becomes complex to encode field names and schemas (other than by forcing a JSON like schema or having header like structures). I might be wrong on multiple fronts here, but text is inherently expensive anyway - so I don't see much benefit in that either.
          Hide
          Sqoop QA bot added a comment -

          Here are the results of testing the latest attachment
          https://issues.apache.org/jira/secure/attachment/12594093/SQOOP-777.patch against branch sqoop2.

          Overall: +1 all checks pass

          SUCCESS: Clean was successful
          SUCCESS: Patch applied correctly
          SUCCESS: Patch compiled
          SUCCESS: All tests passed

          Console output: https://builds.apache.org/job/PreCommit-SQOOP-Build/92/console

          This message is automatically generated.

          Show
          Sqoop QA bot added a comment - Here are the results of testing the latest attachment https://issues.apache.org/jira/secure/attachment/12594093/SQOOP-777.patch against branch sqoop2. Overall: +1 all checks pass SUCCESS: Clean was successful SUCCESS: Patch applied correctly SUCCESS: Patch compiled SUCCESS: All tests passed Console output: https://builds.apache.org/job/PreCommit-SQOOP-Build/92/console This message is automatically generated.
          Hide
          Hari Shreedharan added a comment -

          I will get to the next version of this patch in a few days. As I result, I think it is better to move it out of the current release. (not changing anything in the jira here, since it is already set to 2.0.0)

          Show
          Hari Shreedharan added a comment - I will get to the next version of this patch in a few days. As I result, I think it is better to move it out of the current release. (not changing anything in the jira here, since it is already set to 2.0.0)
          Hide
          Jarek Jarcec Cecho added a comment -

          I'm canceling the patch as we are pending upload of an updated patch. Please don't hesitate Hari Shreedharan and set back the "Patch available" after uploading updated version!

          Show
          Jarek Jarcec Cecho added a comment - I'm canceling the patch as we are pending upload of an updated patch. Please don't hesitate Hari Shreedharan and set back the "Patch available" after uploading updated version!
          Hide
          Abraham Elmahrek added a comment -

          Hari, can I continue the excellent work you've done here?

          Show
          Abraham Elmahrek added a comment - Hari, can I continue the excellent work you've done here?
          Hide
          Hari Shreedharan added a comment -

          Abraham Elmahrek - Please do, I just did not get the rime to come back to this one.

          Show
          Hari Shreedharan added a comment - Abraham Elmahrek - Please do, I just did not get the rime to come back to this one.

            People

            • Assignee:
              Abraham Elmahrek
              Reporter:
              Jarek Jarcec Cecho
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development