Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: from/to
    • Component/s: None
    • Labels:
      None

      Description

      Relational database systems, hierarchical databases, etc. tend to have a well defined schema. Key-value DBs, BigTable clones, etc. tend to have weakly defined schemas. In fact, a key-value datastore may not have any kind of schema (other than the fact is is key-value).

      Schemas seem like they are local to the connector and should not be needed by the framework. Or, there should be a common Schema format that every connector knows how to decipher.

      1. SQOOP-1378.5.patch
        53 kB
        Gwen Shapira
      2. SQOOP-1378.4.patch
        54 kB
        Gwen Shapira
      3. SQOOP-1378.3.patch
        52 kB
        Gwen Shapira
      4. SQOOP-1378.2.patch
        63 kB
        Gwen Shapira
      5. SQOOP-1378.1.patch
        52 kB
        Gwen Shapira
      6. SQOOP-1378.0.patch
        28 kB
        Gwen Shapira

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          FAILURE: Integrated in Sqoop2-hadoop100 #613 (See https://builds.apache.org/job/Sqoop2-hadoop100/613/)
          SQOOP-1378: Sqoop2: From/To: Refactor schema (abraham: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=2c20d920f4ab31ae97ba57952d17677146069b5c)

          • common/src/main/java/org/apache/sqoop/schema/SchemaError.java
          • connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestExtractor.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsInitializer.java
          • common/src/main/java/org/apache/sqoop/json/SubmissionBean.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/NameMatcher.java
          • docs/src/site/sphinx/Tools.rst
          • connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExtractor.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/ConfigurationUtils.java
          • common/src/main/java/org/apache/sqoop/schema/Schema.java
          • common/src/main/java/org/apache/sqoop/schema/SchemaMatchOption.java
          • common/src/main/java/org/apache/sqoop/model/MSubmission.java
          • shell/src/main/java/org/apache/sqoop/shell/utils/SubmissionDisplayer.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnectorError.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/LocationMatcher.java
          • core/src/main/java/org/apache/sqoop/driver/JobManager.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/AbstractMatcher.java
          • common/src/main/java/org/apache/sqoop/json/util/SchemaSerialization.java
          • shell/src/main/resources/shell-resource.properties
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java
          • common/src/main/java/org/apache/sqoop/schema/type/Column.java
          • common/src/test/java/org/apache/sqoop/json/util/TestSchemaSerialization.java
          • common/src/main/java/org/apache/sqoop/job/etl/ExtractorContext.java
          • connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java
          • submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnector.java
          • common/src/main/java/org/apache/sqoop/json/SchemaBean.java
          Show
          Hudson added a comment - FAILURE: Integrated in Sqoop2-hadoop100 #613 (See https://builds.apache.org/job/Sqoop2-hadoop100/613/ ) SQOOP-1378 : Sqoop2: From/To: Refactor schema (abraham: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=2c20d920f4ab31ae97ba57952d17677146069b5c ) common/src/main/java/org/apache/sqoop/schema/SchemaError.java connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestExtractor.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsInitializer.java common/src/main/java/org/apache/sqoop/json/SubmissionBean.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/NameMatcher.java docs/src/site/sphinx/Tools.rst connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExtractor.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/ConfigurationUtils.java common/src/main/java/org/apache/sqoop/schema/Schema.java common/src/main/java/org/apache/sqoop/schema/SchemaMatchOption.java common/src/main/java/org/apache/sqoop/model/MSubmission.java shell/src/main/java/org/apache/sqoop/shell/utils/SubmissionDisplayer.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnectorError.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/LocationMatcher.java core/src/main/java/org/apache/sqoop/driver/JobManager.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/AbstractMatcher.java common/src/main/java/org/apache/sqoop/json/util/SchemaSerialization.java shell/src/main/resources/shell-resource.properties connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java common/src/main/java/org/apache/sqoop/schema/type/Column.java common/src/test/java/org/apache/sqoop/json/util/TestSchemaSerialization.java common/src/main/java/org/apache/sqoop/job/etl/ExtractorContext.java connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnector.java common/src/main/java/org/apache/sqoop/json/SchemaBean.java
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Sqoop2-hadoop200 #547 (See https://builds.apache.org/job/Sqoop2-hadoop200/547/)
          SQOOP-1378: Sqoop2: From/To: Refactor schema (abraham: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=2c20d920f4ab31ae97ba57952d17677146069b5c)

          • shell/src/main/java/org/apache/sqoop/shell/utils/SubmissionDisplayer.java
          • core/src/main/java/org/apache/sqoop/driver/JobManager.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnectorError.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnector.java
          • connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestExtractor.java
          • common/src/main/java/org/apache/sqoop/job/etl/ExtractorContext.java
          • common/src/main/java/org/apache/sqoop/schema/SchemaMatchOption.java
          • shell/src/main/resources/shell-resource.properties
          • submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java
          • common/src/main/java/org/apache/sqoop/schema/type/Column.java
          • connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsInitializer.java
          • common/src/main/java/org/apache/sqoop/json/SchemaBean.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/ConfigurationUtils.java
          • execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java
          • common/src/main/java/org/apache/sqoop/schema/Schema.java
          • common/src/main/java/org/apache/sqoop/json/SubmissionBean.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java
          • connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExtractor.java
          • docs/src/site/sphinx/Tools.rst
          • common/src/test/java/org/apache/sqoop/json/util/TestSchemaSerialization.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/NameMatcher.java
          • common/src/main/java/org/apache/sqoop/schema/SchemaError.java
          • common/src/main/java/org/apache/sqoop/json/util/SchemaSerialization.java
          • connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java
          • common/src/main/java/org/apache/sqoop/model/MSubmission.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/LocationMatcher.java
          • connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/AbstractMatcher.java
          Show
          Hudson added a comment - ABORTED: Integrated in Sqoop2-hadoop200 #547 (See https://builds.apache.org/job/Sqoop2-hadoop200/547/ ) SQOOP-1378 : Sqoop2: From/To: Refactor schema (abraham: https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=2c20d920f4ab31ae97ba57952d17677146069b5c ) shell/src/main/java/org/apache/sqoop/shell/utils/SubmissionDisplayer.java core/src/main/java/org/apache/sqoop/driver/JobManager.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnectorError.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsConnector.java connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestExtractor.java common/src/main/java/org/apache/sqoop/job/etl/ExtractorContext.java common/src/main/java/org/apache/sqoop/schema/SchemaMatchOption.java shell/src/main/resources/shell-resource.properties submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java common/src/main/java/org/apache/sqoop/schema/type/Column.java connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsInitializer.java common/src/main/java/org/apache/sqoop/json/SchemaBean.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/ConfigurationUtils.java execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java common/src/main/java/org/apache/sqoop/schema/Schema.java common/src/main/java/org/apache/sqoop/json/SubmissionBean.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExtractor.java docs/src/site/sphinx/Tools.rst common/src/test/java/org/apache/sqoop/json/util/TestSchemaSerialization.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/NameMatcher.java common/src/main/java/org/apache/sqoop/schema/SchemaError.java common/src/main/java/org/apache/sqoop/json/util/SchemaSerialization.java connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java common/src/main/java/org/apache/sqoop/model/MSubmission.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/LocationMatcher.java connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/matcher/AbstractMatcher.java
          Hide
          Abraham Elmahrek added a comment -

          +1. Thanks Gwen Shapira!

          Show
          Abraham Elmahrek added a comment - +1. Thanks Gwen Shapira !
          Hide
          ASF subversion and git services added a comment -

          Commit 327d4372fedaaa0961c9ba3cb9e3af2c1d9d4bb5 in sqoop's branch refs/heads/SQOOP-1367 from Gwen Shapira
          [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=327d437 ]

          SQOOP-1378: Sqoop2: From/To: Refactor schema

          This patch also changes the tools documentation.

          Show
          ASF subversion and git services added a comment - Commit 327d4372fedaaa0961c9ba3cb9e3af2c1d9d4bb5 in sqoop's branch refs/heads/ SQOOP-1367 from Gwen Shapira [ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=327d437 ] SQOOP-1378 : Sqoop2: From/To: Refactor schema This patch also changes the tools documentation.
          Hide
          Gwen Shapira added a comment -

          New patch addressing Abe's last review.

          Show
          Gwen Shapira added a comment - New patch addressing Abe's last review.
          Hide
          Gwen Shapira added a comment -

          Uploaded a patch with some changes based on Veena's feedback in RB.

          Show
          Gwen Shapira added a comment - Uploaded a patch with some changes based on Veena's feedback in RB.
          Hide
          Gwen Shapira added a comment -
          • Hidden the schema matching logic from connectors. This is completely internal in the IDF for now.
          • re-based on latest 1367 branch
          Show
          Gwen Shapira added a comment - Hidden the schema matching logic from connectors. This is completely internal in the IDF for now. re-based on latest 1367 branch
          Hide
          Gwen Shapira added a comment -

          Tested, has unit tests and merged with latest in branch.

          I don't have additional work planned on this, so please review

          Show
          Gwen Shapira added a comment - Tested, has unit tests and merged with latest in branch. I don't have additional work planned on this, so please review
          Hide
          Gwen Shapira added a comment -

          Working version. Tested manually, LOCATION matching only.
          No unittests and doesn't address Abe's concerns from RB.

          Show
          Gwen Shapira added a comment - Working version. Tested manually, LOCATION matching only. No unittests and doesn't address Abe's concerns from RB.
          Hide
          Gwen Shapira added a comment -

          Planned testing:

          DB->DB
          DB->HDFS
          HDFS->DB
          HDFS->HDFS (should fail - there's no schema)

          Show
          Gwen Shapira added a comment - Planned testing: DB->DB DB->HDFS HDFS->DB HDFS->HDFS (should fail - there's no schema)
          Hide
          Gwen Shapira added a comment -

          Untested. Just to give everyone an idea of what my refactored solution looks like.

          Its fairly extensible (You can add many methods of resolving schemas), but currently I'm only adding "by location" (for writing to CSVs) and "by name" to translate between DB tables.

          Its also pretty tightly coupled with our use of CSV for intermediate data format. We can change it later, but I don't see this as a priority.

          Oh, and I removed a bunch of unused "getSchema" APIs. We no longer have a single schema, and since they were unused, I couldn't figure out which schema they referred to.

          Show
          Gwen Shapira added a comment - Untested. Just to give everyone an idea of what my refactored solution looks like. Its fairly extensible (You can add many methods of resolving schemas), but currently I'm only adding "by location" (for writing to CSVs) and "by name" to translate between DB tables. Its also pretty tightly coupled with our use of CSV for intermediate data format. We can change it later, but I don't see this as a priority. Oh, and I removed a bunch of unused "getSchema" APIs. We no longer have a single schema, and since they were unused, I couldn't figure out which schema they referred to.
          Hide
          Abraham Elmahrek added a comment -

          I'm noticing that the HBase connector could benefit from its own Schema class. A couple of more requests with regards to the Schema class:

          • Make Schema an abstract class with only methods or interface. ABC preferred for backwards compat. story I would imagine.
          • Create a GenericSchema class that implements the new Schema interface or ABC.
          Show
          Abraham Elmahrek added a comment - I'm noticing that the HBase connector could benefit from its own Schema class. A couple of more requests with regards to the Schema class: Make Schema an abstract class with only methods or interface. ABC preferred for backwards compat. story I would imagine. Create a GenericSchema class that implements the new Schema interface or ABC.
          Hide
          Gwen Shapira added a comment -

          Current thoughts are:

          • Every connector needs to support getSchema and return a schema object with collection of columns. These can be defined in any way that makes sense to those writing the connector. (I.e. HDFS schema can be a single column representing a record).
          • Users should also be able to define a job with a fromSchema, toSchema and a transformation (as part of the connector and framework forms). User schemas can be defined in JSON (we already support loading schema from JSON), and the transformations a triads: { toColumn: column name, fromColumn: XPATH expression describing how to get the value of the column from the fromSchema, cast: optional explicit data type casting }
          • Users either supply both schemas and a transformation, or nothing at all and we'll use defaults (matching by column names? dumping entire row as text to a single record? We need to figure out sensible defaults and who controls them)
          • If users supply schemas and transformations we should be able to run some validation and confirm they make sense. This should happen when a job is first defined.
          Show
          Gwen Shapira added a comment - Current thoughts are: Every connector needs to support getSchema and return a schema object with collection of columns. These can be defined in any way that makes sense to those writing the connector. (I.e. HDFS schema can be a single column representing a record). Users should also be able to define a job with a fromSchema, toSchema and a transformation (as part of the connector and framework forms). User schemas can be defined in JSON (we already support loading schema from JSON), and the transformations a triads: { toColumn: column name, fromColumn: XPATH expression describing how to get the value of the column from the fromSchema, cast: optional explicit data type casting } Users either supply both schemas and a transformation, or nothing at all and we'll use defaults (matching by column names? dumping entire row as text to a single record? We need to figure out sensible defaults and who controls them) If users supply schemas and transformations we should be able to run some validation and confirm they make sense. This should happen when a job is first defined.

            People

            • Assignee:
              Gwen Shapira
              Reporter:
              Abraham Elmahrek
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development