Azure Data Factory working with “UCS-2 LE BOM” a.k.a “UTF-16LE with Signature” a.k.a “UTF-16” a.k.a “Windows Unicode”

Cheat Code
Set Encoding to “UTF-16”

So we received this file on our blob and we need to convert it into ORC file for our hive to eat it fast n furious. Using the battle proven Azure Data Factory (ADF) we spin quickly a pipeline to read the blob data which is zipped and have tab delimited and use the copy activity to convert it to ORC file on ADLv2.

After we run the job, we create external table on hive, and we fell off the chair. There are so many extra new line shown on the result when we select * of the external table.

We spin Notepad++, EmEditor to check the encoding and we see “UCS-2 LE BOM” on notepad++ and “UTF-16LE with Signature” on EmEditor… Bloody heck!

UTF-16LE with Signature – DAMN Scary mate!
UCS-2 LE BOM – What a BOMB!

Encoding -> UTF-16

So using the above cheat we can load the CSV to ADLv2 as ORC and now everything live happily ever after.

This is my personal view… I kind of glad that Microsoft provide solution to issue that they invent themselves 😉 whereas UTF-8 just provide whole world a better place to live

Photo by Brett Sayles from Pexels

Published by Feivel

We love to travel!

Leave a comment

Design a site like this with WordPress.com
Get started