(python) Unicode Decode Errors : subtypes specification

Prologue

Main python document says it may return error named Unicode Deocde Error. It did not specifically classified this error.

I explain this as much as I know.

Easiest solution to fix unicode decode error

is to use different codecs. Try these start from left, which is ordered by usage.

'utf-8'
'latin-1' a.k.a 'ISO-8859-1'
Windows ANSI codecs depend on your condition, basically 'cp1252'
'utf-16-le' (in Windows OS, probably shown as 'Unicode')
any other codec use in your condition.

SpecificErrorTypes of unicode decode error

In python, you will see this error written in following format.

UnicodeDecodeError: '#codec_name' code can't decode byte 0x## in position #: #SpecificErrorType

#SpecificErrorType classified below.

(If error value 0x## is 0xff, using utf-16-le solved the problem in most cases.I don't know why.)

Truncated Data

It appears when your input data has smaller bytes which your codec requires.

For example, using utf-16-le need input length is made as multiple of 16bits(==2bytes), but s1 has only 3bytes. Last 1byte cannot be decoded by utf-16-le. utf-16-le cannot read any 1byte data.

s1 = b'\xe2\x80\x99'
print(s1.decode("utf-16-le"))

>>>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x99 in position 2: truncated data

ordinal not in range

input byte is out of range

you may see this easily when you use 'ascii', because it only decodes from 0x00 to 0x7F.

>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 0: ordinal not in range(128)

I recommend you to use 'latin-1' basically instead of 'ascii', to avoid this problem in many cases.

invalid start byte

A first-byte is not written with following codec rule.

The following is example code.

s1 = b'\x00\x11\xff\xe2\x80'
print(s1.decode("utf-8"))

>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
# Note that, \x00\x11 is decoded because they are both available starting byte.

invalid continuation byte

A non-first-byte is not written with following codec rule.

The whole unit has value. If unit starts with correct byte but some bytes after it is actually null, then it returns 'unexpected end of data'.
In python v3.9.6 this error returns the postion of first-byte of unit. It does not return non-first-byte position which made this error. You have to look whole unit.

b1 = b'\xec\x9d\xb4\xec\x1d\xb4'
s1 = b1.decode('utf-8')

print(s1)

>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 3: invalid continuation byte

## Convert hexadecimal to binary numbers, then check bytes. It is easy to understand. 
  \xec\x9d\xb4
= e c 9 d b 4
= 1110 1100, 1001 1101, 1011 0100

  \xec\x1d\xb4
= e c 1 d b 4
= 1110 1100, 0001 1101, 1011 0100

# non-first-byte must be written in 0b 10xx xxxx. However 0x 1d == 0b 0001 1101.

unexpected end of data

This appears when unit has a undefined byte.

s0 = '이'
b = s0.encode('utf-8') 
>>> 
b'\xec\x9d\xb4'

## Error code
b1 = b'\xec\x9d'
s1 = b1.decode('utf-8')

print(s1)

>>>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

'''
Explain: Unit starts with \xec.
By the rule of utf-8, computer try to read 3bytes then decode.
b1 has only 2bytes, the 3rd byte is undefined.

illegal multibyte sequence

It appears if decoded unit is an undefined character.

You may see this type usually when not using 'utf-8'. I cannot exactly say which bytevalue is missing for a given codec. Undefined character value has no rule.

b1 = b'\xF8\x64'
s1 = b1.decode('cp949')

print(s1)

>>>
UnicodeDecodeError: 'cp949' codec can't decode byte 0xf8 in position 0: illegal multibyte sequence

Explain the byte using rule of 'utf-8'

Main encoding codec is 'utf-8' nowdays in 2020s.

Knowing the byte using rule of this will help you to find next codec to use in short time.

A letter is encoded to 1~4 bytes with 'utf-8' (over 5 bytes exist ideally, but never used)

utf-8 byte size	available values	Not-aviailable values	details
1byte	\x00 ~ \x7F	\x80 ~ \xFF	must be in Bin 0xxx xxxx 'ascii' is compatible to 'utf-8'. Note that 'latin-1' codec is NOT compatible to 'utf-8'
non-first-byte IN multi bytes	\x80 ~ \BF	\x00 ~ \x7F or \xC0 ~ \xFF	must be in Bin 10xx xxxx Only can start with Hex 8,9,A,B (Bin 1000 ~ 1011) Error occurs when start with Hex 0~7 or C~F
first-byte IN 2bytes	\xC0 ~ \xDF	\x00 ~ \xBF or \xE0 ~ \xFF	must be in Bin 110x xxxx Only can start with Hex C,D (Bin 1100,1101) Error occurs when start with Hex 0~B or E,F
first-byte IN 3bytes	\xE0 ~ \xEF	\x00 ~ \xDF or \xF0 ~ \xFF	must be in Bin 1110 xxxx Only can start with Hex E(1110) Error occurs when start with other Hex value
first-byte IN 4bytes	\xF0 ~ \xF7	\x00 ~ \xEF or \xF8 ~ \xFF	must be in Bin 1111 0xxx Only can start with Hex F (Bin 1111) Also, end can use Hex 0~7 (Bin 0000 ~ 0111)

'English Articles' 카테고리의 다른 글

How to insert 'google search option' in context menu of VSCode? Use VSCode extension (0)	2023.03.15
Windows11, Tried to remove 'new bitmap image' from context menu (New) (0)	2023.01.11
Python solve partial of, SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position No1-No2: truncated \uXXXX escape (0)	2022.11.23
how to make a shorter named function which is originally a long-named function (0)	2022.09.27

구글교 프로그래머 노가다꾼

(python) Unicode Decode Errors : subtypes specification

Prologue

Easiest solution to fix unicode decode error

SpecificErrorTypes of unicode decode error

Explain the byte using rule of 'utf-8'

'English Articles' 카테고리의 다른 글

티스토리툴바

(python) Unicode Decode Errors : subtypes specification

Prologue

Easiest solution to fix unicode decode error

SpecificErrorTypes of unicode decode error

Explain the byte using rule of 'utf-8'

'English Articles' 카테고리의 다른 글

'English Articles' Related Articles

티스토리툴바