본문 바로가기

English Articles

(python) Unicode Decode Errors : subtypes specification

Prologue

Main python document says it may return error named Unicode Deocde Error. It did not specifically classified this error.

I explain this as much as I know.

Easiest solution to fix unicode decode error

is to use different codecs. Try these start from left, which is ordered by usage. 

  1. 'utf-8'
  2. 'latin-1' a.k.a 'ISO-8859-1'
  3. Windows ANSI codecs depend on your condition, basically 'cp1252'
  4. 'utf-16-le' (in Windows OS, probably shown as 'Unicode')
  5. any other codec use in your condition.

 

SpecificErrorTypes of unicode decode error 

In python, you will see this error written in following format.

  • UnicodeDecodeError: '#codec_name' code can't decode byte 0x## in position #: #SpecificErrorType

#SpecificErrorType classified below. 

(If error value 0x## is 0xff, using utf-16-le solved the problem in most cases.I don't know why.)

 

Truncated Data

It appears when your input data has smaller bytes which your codec requires.

  • For example, using utf-16-le need input length is made as multiple of 16bits(==2bytes), but s1 has only 3bytes. Last 1byte cannot be decoded by utf-16-le. utf-16-le cannot read any 1byte data.
s1 = b'\xe2\x80\x99'
print(s1.decode("utf-16-le"))

>>>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x99 in position 2: truncated data

 

ordinal not in range

input byte is out of range

you may see this easily when you use 'ascii', because it only decodes from 0x00 to 0x7F.

>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 0: ordinal not in range(128)

I recommend you to use 'latin-1' basically instead of 'ascii', to avoid this problem in many cases.

 

invalid start byte

A first-byte is not written with following codec rule.

The following is example code.

s1 = b'\x00\x11\xff\xe2\x80'
print(s1.decode("utf-8"))

>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
# Note that, \x00\x11 is decoded because they are both available starting byte.

 

invalid continuation byte

A non-first-byte is not written with following codec rule. 

  • The whole unit has value. If unit starts with correct byte but some bytes after it is actually null, then it returns 'unexpected end of data'.
  • In python v3.9.6 this error returns the postion of first-byte of unit. It does not return non-first-byte position which made this error. You have to look whole unit.
b1 = b'\xec\x9d\xb4\xec\x1d\xb4'
s1 = b1.decode('utf-8')

print(s1)

>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 3: invalid continuation byte

## Convert hexadecimal to binary numbers, then check bytes. It is easy to understand. 
  \xec\x9d\xb4
= e c 9 d b 4
= 1110 1100, 1001 1101, 1011 0100

  \xec\x1d\xb4
= e c 1 d b 4
= 1110 1100, 0001 1101, 1011 0100

# non-first-byte must be written in 0b 10xx xxxx. However 0x 1d == 0b 0001 1101.

 

unexpected end of data

This appears when unit has a undefined byte.

s0 = '이'
b = s0.encode('utf-8') 
>>> 
b'\xec\x9d\xb4'

## Error code
b1 = b'\xec\x9d'
s1 = b1.decode('utf-8')

print(s1)

>>>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

'''
Explain: Unit starts with \xec.
By the rule of utf-8, computer try to read 3bytes then decode.
b1 has only 2bytes, the 3rd byte is undefined.

 

illegal multibyte sequence

It appears if decoded unit is an undefined character.

You may see this type usually when not using 'utf-8'. I cannot exactly say which bytevalue is missing for a given codec. Undefined character value has no rule. 

b1 = b'\xF8\x64'
s1 = b1.decode('cp949')

print(s1)

>>>
UnicodeDecodeError: 'cp949' codec can't decode byte 0xf8 in position 0: illegal multibyte sequence

Explain the byte using rule of 'utf-8'

Main encoding codec is 'utf-8' nowdays in 2020s.

Knowing the byte using rule of this will help you to find next codec to use in short time.

A letter is encoded to 1~4 bytes with 'utf-8' (over 5 bytes exist ideally, but never used)

utf-8 byte size available
values
Not-aviailable
values

details
1byte \x00 ~ \x7F \x80 ~ \xFF must be in Bin 0xxx xxxx 
'ascii' is compatible to 'utf-8'.
Note that 'latin-1' codec is NOT compatible to 'utf-8'
non-first-byte
IN multi bytes
\x80 ~ \BF \x00 ~ \x7F

or

\xC0 ~ \xFF
must be in Bin 10xx xxxx 
Only can start with Hex 8,9,A,B (Bin 1000 ~ 1011)
Error occurs when start with Hex 0~7 or C~F 
first-byte
IN 2bytes
\xC0 ~ \xDF \x00 ~ \xBF

or

\xE0 ~ \xFF
must be in Bin 110x xxxx 
Only can start with Hex C,D (Bin 1100,1101)
Error occurs when start with Hex  0~B or E,F
first-byte
IN 3bytes
\xE0 ~ \xEF \x00 ~ \xDF

or

\xF0 ~ \xFF
must be in Bin 1110 xxxx 
Only can start with Hex E(1110)
Error occurs when start with other Hex value
first-byte
IN 4bytes
\xF0 ~ \xF7 \x00 ~ \xEF

or

\xF8 ~ \xFF
must be in Bin 1111 0xxx
Only can start with Hex F (Bin 1111) 
Also, end can use Hex 0~7 (Bin 0000 ~ 0111)