Python3标准库：codecs字符串编码和解码(3)

当前位置:

首页 > Python基础教程 >

Python3标准库：codecs字符串编码和解码(3)

def to_hex(t, nbytes):
			"""Format text t as a sequence of nbyte long values

			separated by spaces.

			"""

			chars_per_item = nbytes * 2

			hex_version = binascii.hexlify(t)

			return b' '.join(

			hex_version[start:start + chars_per_item]

			for start in range(0, len(hex_version), chars_per_item)

			)

			# Pick the nonnative version of UTF-16 encoding

			if codecs.BOM_UTF16 == codecs.BOM_UTF16_BE:

			bom = codecs.BOM_UTF16_LE

			encoding = 'utf_16_le'

			else:

			bom = codecs.BOM_UTF16_BE

			encoding = 'utf_16_be'

			print('Native order :', to_hex(codecs.BOM_UTF16, 2))

			print('Selected order:', to_hex(bom, 2))

			# Encode the text.

			encoded_text = 'français'.encode(encoding)

			print('{:14}: {}'.format(encoding, to_hex(encoded_text, 2)))

			with open('nonnative-encoded.txt', mode='wb') as f:

			# Write the selected byte-order marker. It is not included

			# in the encoded text because the byte order was given

			# explicitly when selecting the encoding.

			f.write(bom)

			# Write the byte string for the encoded text.

			f.write(encoded_text)

首先得出原生字节序，然后显式的使用替代形式，以便下一个例子可以在展示读取时自动检测字节序。

程序打开文件时没有指定字节序，所以解码器会使用文件前两个字节中的BOM值来确定字节序。

import codecs
import binascii
def to_hex(t, nbytes):
"""Format text t as a sequence of nbyte long values
separated by spaces.
"""
chars_per_item = nbytes * 2
hex_version = binascii.hexlify(t)
return b' '.join(
hex_version[start:start + chars_per_item]
for start in range(0, len(hex_version), chars_per_item)
)
# Look at the raw data
with open('nonnative-encoded.txt', mode='rb') as f:
raw_bytes = f.read()
print('Raw :', to_hex(raw_bytes, 2))
# Re-open the file and let codecs detect the BOM
with codecs.open('nonnative-encoded.txt',
mode='r',
encoding='utf-16',
) as f:
decoded_text = f.read()
print('Decoded:', repr(decoded_text))

由于文件的前两个字节用于字节序检测，所以它们并不包含在read()返回的数据中。

1.4 错误处理

前几节指出，读写Unicode文件时需要知道所使用的编码。正确的设置编码很重要，这有两个原因：首先，如果读文件时未能正确的配置编码，就无法正确的解释数据，数据有可能被破坏或无法解码，就会产生一个错误，可能丢失数据。

类似于str的encode()方法和bytes的decode()方法，codecs也使用了同样的5个错误处理选项。

错误模式	描述
`strict`	如果无法转换数据，则会引发异常。
`replace`	将特殊的标记字符替换为无法编码的数据。
`ignore`	跳过数据。
`xmlcharrefreplace`	XML字符（仅编码）
`backslashreplace`	转义序列（仅编码）

1.4.1 编码错误

最常见的错误是在向一个ASCII输出流(如一个常规文件或sys.stdout)写Unicode数据时接收到一个UnicodeEncodeError。

import codecs
error_handlings = ['strict','replace','ignore','xmlcharrefreplace','backslashreplace']
text = 'français'
for error_handling in error_handlings:
try:
# Save the data, encoded as ASCII, using the error
# handling mode specified on the command line.
with codecs.open('encode_error.txt', 'w',
encoding='ascii',
errors=error_handling)

栏目列表