Prologue
Main python document says it may return error named Unicode Deocde Error. It did not specifically classified this error.
I explain this as much as I know.
Easiest solution to fix unicode decode error
is to use different codecs. Try these start from left, which is ordered by usage.
- 'utf-8'
- 'latin-1' a.k.a 'ISO-8859-1'
- Windows ANSI codecs depend on your condition, basically 'cp1252'
- 'utf-16-le' (in Windows OS, probably shown as 'Unicode')
- any other codec use in your condition.
SpecificErrorTypes of unicode decode error
In python, you will see this error written in following format.
- UnicodeDecodeError: '#codec_name' code can't decode byte 0x## in position #: #SpecificErrorType
#SpecificErrorType classified below.
(If error value 0x## is 0xff, using utf-16-le solved the problem in most cases.I don't know why.)
Truncated Data
It appears when your input data has smaller bytes which your codec requires.
- For example, using utf-16-le need input length is made as multiple of 16bits(==2bytes), but s1 has only 3bytes. Last 1byte cannot be decoded by utf-16-le. utf-16-le cannot read any 1byte data.
s1 = b'\xe2\x80\x99'
print(s1.decode("utf-16-le"))
>>>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x99 in position 2: truncated data
ordinal not in range
input byte is out of range
you may see this easily when you use 'ascii', because it only decodes from 0x00 to 0x7F.
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 0: ordinal not in range(128)
I recommend you to use 'latin-1' basically instead of 'ascii', to avoid this problem in many cases.
invalid start byte
A first-byte is not written with following codec rule.
The following is example code.
s1 = b'\x00\x11\xff\xe2\x80'
print(s1.decode("utf-8"))
>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
# Note that, \x00\x11 is decoded because they are both available starting byte.
invalid continuation byte
A non-first-byte is not written with following codec rule.
- The whole unit has value. If unit starts with correct byte but some bytes after it is actually null, then it returns 'unexpected end of data'.
- In python v3.9.6 this error returns the postion of first-byte of unit. It does not return non-first-byte position which made this error. You have to look whole unit.
b1 = b'\xec\x9d\xb4\xec\x1d\xb4'
s1 = b1.decode('utf-8')
print(s1)
>>>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 3: invalid continuation byte
## Convert hexadecimal to binary numbers, then check bytes. It is easy to understand.
\xec\x9d\xb4
= e c 9 d b 4
= 1110 1100, 1001 1101, 1011 0100
\xec\x1d\xb4
= e c 1 d b 4
= 1110 1100, 0001 1101, 1011 0100
# non-first-byte must be written in 0b 10xx xxxx. However 0x 1d == 0b 0001 1101.
unexpected end of data
This appears when unit has a undefined byte.
s0 = '이'
b = s0.encode('utf-8')
>>>
b'\xec\x9d\xb4'
## Error code
b1 = b'\xec\x9d'
s1 = b1.decode('utf-8')
print(s1)
>>>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
'''
Explain: Unit starts with \xec.
By the rule of utf-8, computer try to read 3bytes then decode.
b1 has only 2bytes, the 3rd byte is undefined.
illegal multibyte sequence
It appears if decoded unit is an undefined character.
You may see this type usually when not using 'utf-8'. I cannot exactly say which bytevalue is missing for a given codec. Undefined character value has no rule.
b1 = b'\xF8\x64'
s1 = b1.decode('cp949')
print(s1)
>>>
UnicodeDecodeError: 'cp949' codec can't decode byte 0xf8 in position 0: illegal multibyte sequence
Explain the byte using rule of 'utf-8'
Main encoding codec is 'utf-8' nowdays in 2020s.
Knowing the byte using rule of this will help you to find next codec to use in short time.
A letter is encoded to 1~4 bytes with 'utf-8' (over 5 bytes exist ideally, but never used)
utf-8 byte size | available values |
Not-aviailable values |
details |
1byte | \x00 ~ \x7F | \x80 ~ \xFF | must be in Bin 0xxx xxxx 'ascii' is compatible to 'utf-8'. Note that 'latin-1' codec is NOT compatible to 'utf-8' |
non-first-byte IN multi bytes |
\x80 ~ \BF | \x00 ~ \x7F or \xC0 ~ \xFF |
must be in Bin 10xx xxxx Only can start with Hex 8,9,A,B (Bin 1000 ~ 1011) Error occurs when start with Hex 0~7 or C~F |
first-byte IN 2bytes |
\xC0 ~ \xDF | \x00 ~ \xBF or \xE0 ~ \xFF |
must be in Bin 110x xxxx Only can start with Hex C,D (Bin 1100,1101) Error occurs when start with Hex 0~B or E,F |
first-byte IN 3bytes |
\xE0 ~ \xEF | \x00 ~ \xDF or \xF0 ~ \xFF |
must be in Bin 1110 xxxx Only can start with Hex E(1110) Error occurs when start with other Hex value |
first-byte IN 4bytes |
\xF0 ~ \xF7 | \x00 ~ \xEF or \xF8 ~ \xFF |
must be in Bin 1111 0xxx Only can start with Hex F (Bin 1111) Also, end can use Hex 0~7 (Bin 0000 ~ 0111) |