Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.4.6
-
None
-
None
-
Postgres & its driver >= 9.4
Sqoop 1.4.6
Description
The issue at hand could be described as below -
a) Postgres DB encoding is set to SQL_ASCII
b) Sqoop import by default is setting the encoding to UTF-8, thereby causing errors of the form -
"""
ERROR: invalid byte sequence for encoding "UTF8": 0x80
"""
c) Given the above, there should be a method to set up custom client_encoding for postgres.
d) The alternate method is to use convert_to function for every column that has the issue and generate binary data.
e.g.,
"""
select id, name, convert_to(badchars1, 'SQL_ASCII'), convert_to(badchars2, 'SQL_ASCII'), convert_to(badchars3, 'SQL_ASCII'), dob from t1;
"""
The steps to reproduce the issue are as below -
a) DB & Tables setup. Please note that the issue could be reproduced with just 1 column and the data of chr(128) in it. But, my example here is to test a workaround of using convert_to function to overcome the issue.
"""
-bash-4.2$ psql
psql (9.4.21)
Type "help" for help.
postgres=# create database d0 template=template0 encoding='SQL_ASCII';
CREATE DATABASE
postgres=# create user u0 password 'password';
CREATE ROLE
postgres=# grant all privileges on database d0 to u0;
GRANT
postgres=# \c d0;
You are now connected to database "d0" as user "postgres".
d0=# create table t1 (id int, name varchar(128), badchars1 varchar(100), badchars2 varchar(100), badchars3 varchar(56), dob varchar(10));
CREATE TABLE
d0=# grant all privileges on table t1 to u0;
GRANT
d0=# insert into t1 values(1, 'venu', concat(chr(1),chr(2),chr(3),chr(4),chr(5),chr(6),chr(7),chr(8),chr(9),chr(10),chr(11),chr(12),chr(13),chr(14),chr(15),chr(16 ),chr(17),chr(18),chr(19),chr(20),chr(21),chr(22),chr(23),chr(24),chr(25),chr(26),chr(27),chr(28),chr(29),chr(30),chr(31),chr(32),chr(33),chr(34),chr(35),chr(36), chr(37),chr(38),chr(39),chr(40),chr(41),chr(42),chr(43),chr(44),chr(45),chr(46),chr(47),chr(48),chr(49),chr(50),chr(51),chr(52),chr(53),chr(54),chr(55),chr(56),ch r(57),chr(58),chr(59),chr(60),chr(61),chr(62),chr(63),chr(64),chr(65),chr(66),chr(67),chr(68),chr(69),chr(70),chr(71),chr(72),chr(73),chr(74),chr(75),chr(76),chr( 77),chr(78),chr(79),chr(80),chr(81),chr(82),chr(83),chr(84),chr(85),chr(86),chr(87),chr(88),chr(89),chr(90),chr(91),chr(92),chr(93),chr(94),chr(95),chr(96),chr(97 ),chr(98),chr(99)), concat(chr(100),chr(101),chr(102),chr(103),chr(104),chr(105),chr(106),chr(107),chr(108),chr(109),chr(110),chr(111),chr(112),chr(113),chr(114), chr(115),chr(116),chr(117),chr(118),chr(119),chr(120),chr(121),chr(122),chr(123),chr(124),chr(125),chr(126),chr(127),chr(128),chr(129),chr(130),chr(131),chr(132), chr(133),chr(134),chr(135),chr(136),chr(137),chr(138),chr(139),chr(140),chr(141),chr(142),chr(143),chr(144),chr(145),chr(146),chr(147),chr(148),chr(149),chr(150), chr(151),chr(152),chr(153),chr(154),chr(155),chr(156),chr(157),chr(158),chr(159),chr(160),chr(161),chr(162),chr(163),chr(164),chr(165),chr(166),chr(167),chr(168), chr(169),chr(170),chr(171),chr(172),chr(173),chr(174),chr(175),chr(176),chr(177),chr(178),chr(179),chr(180),chr(181),chr(182),chr(183),chr(184),chr(185),chr(186), chr(187),chr(188),chr(189),chr(190),chr(191),chr(192),chr(193),chr(194),chr(195),chr(196),chr(197),chr(198),chr(199)), concat(chr(200),chr(201),chr(202),chr(203), chr(204),chr(205),chr(206),chr(207),chr(208),chr(209),chr(210),chr(211),chr(212),chr(213),chr(214),chr(215),chr(216),chr(217),chr(218),chr(219),chr(220),chr(221), chr(222),chr(223),chr(224),chr(225),chr(226),chr(227),chr(228),chr(229),chr(230),chr(231),chr(232),chr(233),chr(234),chr(235),chr(236),chr(237),chr(238),chr(239), chr(240),chr(241),chr(242),chr(243),chr(244),chr(245),chr(246),chr(247),chr(248),chr(249),chr(250),chr(251),chr(252),chr(253),chr(254),chr(255)), '1980-09-02');
INSERT 0 1
d0=# select * from t1;
ERROR: invalid byte sequence for encoding "UTF8": 0x80
"""
b) The issue is noticed even from psql shell.
c) However, if we set CLIENT_ENCODING to SQL_ASCII, the select works.
d) sqoop eval allows us to run 2 instructions in the query field thereby allowing us to "set client_encoding=SQL_ASCII"
sqoop eval --connect "jdbc:postgresql://$(hostname -f)/d0?allowEncodingChanges=True" --username u0 --password password --query "set client_encoding=SQL_ASCII;show client_encoding"
e) However, the same method of setting client_encoding and running the actual select statement is not possible using sqoop import.
=====
Hence, the pray here is to identify a method to allow postgres DB to be able to set a custom client_encoding for sqoop imports.