Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-3434

A method to set custom encoding for sqoop import on postgres connections.

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.6
    • Fix Version/s: None
    • Component/s: connectors/postgresql
    • Labels:
      None
    • Environment:

      Postgres & its driver >= 9.4

      Sqoop 1.4.6

      Description

      The issue at hand could be described as below -

      a) Postgres DB encoding is set to SQL_ASCII

      b) Sqoop import by default is setting the encoding to UTF-8, thereby causing errors of the form -

      """
      ERROR: invalid byte sequence for encoding "UTF8": 0x80
      """

      c) Given the above, there should be a method to set up custom client_encoding for postgres.

      d) The alternate method is to use convert_to function for every column that has the issue and generate binary data.

      e.g.,

      """

      select id, name, convert_to(badchars1, 'SQL_ASCII'), convert_to(badchars2, 'SQL_ASCII'), convert_to(badchars3, 'SQL_ASCII'), dob from t1;

      """

       

      The steps to reproduce the issue are as below -

      a) DB & Tables setup. Please note that the issue could be reproduced with just 1 column and the data of chr(128) in it. But, my example here is to test a workaround of using convert_to function to overcome the issue.

      """

      -bash-4.2$ psql
      psql (9.4.21)
      Type "help" for help.

      postgres=# create database d0 template=template0 encoding='SQL_ASCII';
      CREATE DATABASE
      postgres=# create user u0 password 'password';
      CREATE ROLE
      postgres=# grant all privileges on database d0 to u0;
      GRANT
      postgres=# \c d0;
      You are now connected to database "d0" as user "postgres".
      d0=# create table t1 (id int, name varchar(128), badchars1 varchar(100), badchars2 varchar(100), badchars3 varchar(56), dob varchar(10));
      CREATE TABLE
      d0=# grant all privileges on table t1 to u0;
      GRANT
      d0=# insert into t1 values(1, 'venu', concat(chr(1),chr(2),chr(3),chr(4),chr(5),chr(6),chr(7),chr(8),chr(9),chr(10),chr(11),chr(12),chr(13),chr(14),chr(15),chr(16                    ),chr(17),chr(18),chr(19),chr(20),chr(21),chr(22),chr(23),chr(24),chr(25),chr(26),chr(27),chr(28),chr(29),chr(30),chr(31),chr(32),chr(33),chr(34),chr(35),chr(36),                    chr(37),chr(38),chr(39),chr(40),chr(41),chr(42),chr(43),chr(44),chr(45),chr(46),chr(47),chr(48),chr(49),chr(50),chr(51),chr(52),chr(53),chr(54),chr(55),chr(56),ch                    r(57),chr(58),chr(59),chr(60),chr(61),chr(62),chr(63),chr(64),chr(65),chr(66),chr(67),chr(68),chr(69),chr(70),chr(71),chr(72),chr(73),chr(74),chr(75),chr(76),chr(                    77),chr(78),chr(79),chr(80),chr(81),chr(82),chr(83),chr(84),chr(85),chr(86),chr(87),chr(88),chr(89),chr(90),chr(91),chr(92),chr(93),chr(94),chr(95),chr(96),chr(97                    ),chr(98),chr(99)), concat(chr(100),chr(101),chr(102),chr(103),chr(104),chr(105),chr(106),chr(107),chr(108),chr(109),chr(110),chr(111),chr(112),chr(113),chr(114),                    chr(115),chr(116),chr(117),chr(118),chr(119),chr(120),chr(121),chr(122),chr(123),chr(124),chr(125),chr(126),chr(127),chr(128),chr(129),chr(130),chr(131),chr(132),                    chr(133),chr(134),chr(135),chr(136),chr(137),chr(138),chr(139),chr(140),chr(141),chr(142),chr(143),chr(144),chr(145),chr(146),chr(147),chr(148),chr(149),chr(150),                    chr(151),chr(152),chr(153),chr(154),chr(155),chr(156),chr(157),chr(158),chr(159),chr(160),chr(161),chr(162),chr(163),chr(164),chr(165),chr(166),chr(167),chr(168),                    chr(169),chr(170),chr(171),chr(172),chr(173),chr(174),chr(175),chr(176),chr(177),chr(178),chr(179),chr(180),chr(181),chr(182),chr(183),chr(184),chr(185),chr(186),                    chr(187),chr(188),chr(189),chr(190),chr(191),chr(192),chr(193),chr(194),chr(195),chr(196),chr(197),chr(198),chr(199)), concat(chr(200),chr(201),chr(202),chr(203),                    chr(204),chr(205),chr(206),chr(207),chr(208),chr(209),chr(210),chr(211),chr(212),chr(213),chr(214),chr(215),chr(216),chr(217),chr(218),chr(219),chr(220),chr(221),                    chr(222),chr(223),chr(224),chr(225),chr(226),chr(227),chr(228),chr(229),chr(230),chr(231),chr(232),chr(233),chr(234),chr(235),chr(236),chr(237),chr(238),chr(239),                    chr(240),chr(241),chr(242),chr(243),chr(244),chr(245),chr(246),chr(247),chr(248),chr(249),chr(250),chr(251),chr(252),chr(253),chr(254),chr(255)), '1980-09-02');
      INSERT 0 1
      d0=# select * from t1;
      ERROR:  invalid byte sequence for encoding "UTF8": 0x80

      """

      b) The issue is noticed even from psql shell.

      c) However, if we set CLIENT_ENCODING to SQL_ASCII, the select works.

      d) sqoop eval allows us to run 2 instructions in the query field thereby allowing us to "set client_encoding=SQL_ASCII"
      sqoop eval --connect "jdbc:postgresql://$(hostname -f)/d0?allowEncodingChanges=True" --username u0 --password password --query "set client_encoding=SQL_ASCII;show client_encoding"
      e) However, the same method of setting client_encoding and running the actual select statement is not possible using sqoop import.

      =====

      Hence, the pray here is to identify a method to allow postgres DB to be able to set a custom client_encoding for sqoop imports.

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              venu.yanamandra@live.com Venu Yanamandra
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: