PEP 100 – Python Unicode 集成

历史笔记

这份文件最初是由马克-安德烈在 PEP 之前的日子里写的，最初在 Python 中以 Misc/unicode.txt 的形式分发 Python 2.1 之前的发行版。最新修订版该位置的提案被标记为版本 1.7 （CVS 修订版3.10）。因为该文件清楚地达到了目的在后 PEP 时代的信息 PEP，它已被移动此处并重新格式化以符合 PEP 指南。前途将对本文件进行修订，而 Misc/unicode.txt 将包含指向此 PEP 的指针。

-Barry Warsaw，PEP编辑

介绍

该提案的想法是添加本机 Unicode 3.0 支持 Python 以一种使用 Unicode 字符串的方式，就像可能在此过程中不会引入太多陷阱。

由于这个目标并不容易实现——字符串是其中之一 Python 中最基本的对象——我们希望这个提案能经过一些重大改进。

请注意，此提案的当前版本仍然有点由于 Unicode-Python 的许多不同方面而未排序集成。

[编者注：应对本 PEP 文件进行新的修订，而版本 1.7 之前的历史记录应该是从 MAL 的 url 或 Misc/unicode.txt 检索]

约定

在示例中，我们使用 u = Unicode 对象和 s = Python 字符串
“XXX”标记表示讨论点（POD）

总评

Unicode 编码名称在输出时应为小写，并且输入时不区分大小写（它们将转换为小写由所有采用编码名称作为输入的 API）。
编码名称应遵循 Unicode Consortium：空格转换为连字符，例如 'utf 16' 写成 'UTF-16'。
编解码器模块应使用相同的名称，但带有连字符转换为下划线，例如，， .utf_8utf_16iso_8859_1

Unicode 默认编码

Unicode 实现必须对传递给它的 8 位字符串的编码，用于强制，并且大约在以下情况下，将 Unicode 转换为字符串的默认编码没有给出特定的编码。此编码称为 <default 编码>贯穿整个文本。

为此，实现维护一个可以设置的全局在 site.py Python 启动脚本中。后续更改不会可能。<默认编码>可以使用两个 sys 模块 API：

sys.setdefaultencoding(encoding)
设置 Unicode 实现使用的<默认编码>。编码必须是 Python 安装，否则会引发 LookupError。
注意：此 API 仅在 site.py 中可用！是的使用后通过 site.py 从 sys 模块中删除。
sys.getdefaultencoding()
返回当前<默认编码>。

如果未另行定义或设置，则<默认编码>默认值更改为“ASCII”。此编码也是 Python 的启动默认值（并在 site.py 执行之前生效）。

请注意，默认 site.py 启动模块包含禁用 ><可选代码，可以根据由当前区域设置定义的编码。locale 模块是用于从区域设置默认设置中提取编码由 OS 环境定义（请参阅 locale.py）。如果编码无法确定、未知或不受支持，代码默认值将<默认编码>设置为“ASCII”。要启用此功能，请执行此操作代码，编辑 site.py 文件或将相应的代码放入 sitecustomize.py Python 安装的模块。

Unicode 构造函数

Python 应该为 Unicode 字符串提供内置构造函数可通过以下方式获得：__builtins__

u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])

u = u'<unicode-escape encoded Python string>'

u = ur'<raw-unicode-escape encoded Python string>'

“unicode-escape”编码定义为：

所有非转义字符都表示为 Unicode 序数（例如“A”->U+0061）。

所有现有定义的 Python 转义序列都解释为 Unicode 序数;请注意，可以表示所有Unicode 序数和（octal）可以表示 Unicode 序数，最多 U+01FF的。\xXXXX\OOO

一个新的转义序列，表示 U+XXXX;它是一种语法错误后少于 4 位数字。\uXXXX\u

有关错误可能值的说明，请参阅编解码器部分。

例子：

u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234u'abc\
u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c

“raw-unicode-escape”编码定义如下：

\uXXXXsequence 表示 U+XXXX Unicode 字符，如果和仅当前导反斜杠的数量为奇数时
所有其他字符都表示自己为 Unicode 序号（例如：'b' -> U+0062）

请注意，您应该为过去的编码提供一些提示将您的程序编写为前几条评论中的杂用行源文件的行（例如 '# 源文件编码：latin-1'）。如果您只使用 7 位 ASCII，那么一切都很好，没有这样需要注意，但如果您包含拉丁 1 字符，则不在 ASCII 中定义，很可能值得包含一个提示，因为其他国家/地区的人们希望能够阅读您的来源字符串也是。

Unicode Type 对象

Unicode 对象应具有类型名称的 UnicodeType “unicode”，通过标准类型模块提供。

Unicode 输出

Unicode 对象有一个方法 .encode（[encoding=<默认编码>]）它返回一个 Python 字符串，使用给定方案（请参阅编解码器）。

print u := print u.encode()   # using the <default encoding>

str(u)  := u.encode()         # using the <default encoding>

repr(u) := "u%s" % repr(u.encode('unicode-escape'))

另请参阅内部参数解析和缓冲区接口有关用 C 编写的其他 API 如何处理 Unicode 对象的详细信息。

Unicode 序号

由于 Unicode 3.0 具有 32 位序号字符集，因此实现应提供 32 位感知顺序转换蜜蜂属：

ord(u[:1]) (this is the standard ord() extended to work with Unicode
            objects)
  --> Unicode ordinal number (32-bit)
unichr(i)
    --> Unicode object for character i (provided it is 32-bit);
        ValueError otherwise

两个 API 都应该像它们的字符串一样进入对应项和 .__builtins__ord()chr()

请注意，Unicode 为私有编码提供了空间。用法这些可能会导致不同的输出表示形式不同机器。这个问题不是 Python 或 Unicode 问题，而是机器设置和维护一。

比较 & 哈希值

Unicode 对象应与这些之后的其他对象进行比较其他对象已被强制使用 Unicode。对于字符串，这意味着它们被解释为 Unicode 字符串，使用 <默认编码>。

Unicode 对象应返回与其 ASCII 相同的哈希值等效字符串。包含非 ASCII 值的 Unicode 字符串是不保证返回与默认值相同的哈希值编码的等效字符串表示形式。

当使用（或）比较时实现应屏蔽转换期间引发的问题与字符串行为保持同步。所有其他错误例如在强制将字符串强制为 Unicode 期间提出不应被屏蔽并传递给用户。cmp()PyObject_Compare()TypeErrorsValueErrors

在遏制测试中（u'abc中的“a”和“abc”中的u“a”）双方在应用测试之前，应强制使用 Unicode。错误在胁迫期间发生（例如，u'abc'中没有）不应该是蒙面。

强迫

使用 Python 字符串和 Unicode 对象形成新对象应该始终强制使用更精确的格式，即 Unicode 对象。

u + s := u + unicode(s)

s + u := unicode(s) + u

所有字符串方法都应将调用委托给等效项 Unicode 对象方法调用，将所有涉及的字符串转换为 Unicode，然后将参数应用于相同的名称，例如

string.join((s,u),sep) := (s + sep) + u

sep.join((s,u)) := (s + sep) + u

有关将 %-formatting w/r 转换为 Unicode 对象的讨论，请参阅格式标记。

异常

UnicodeError在 exceptions 模块中定义为的子类。它可通过 C 级别获得。与 Unicode 相关的所有例外编码/解码应是的子类。ValueErrorPyExc_UnicodeErrorUnicodeError

编解码器（编码器/解码器）查找

编解码器（请参阅编解码器接口定义）搜索注册表应为由模块“编解码器”实现：

codecs.register(search_function)

搜索函数应采用一个参数，即编码名称全部为小写字母，并带有连字符和空格转换为下划线，并返回函数元组（编码器、解码器、stream_reader、stream_writer）获取以下参数：

编码器和解码器

这些必须是具有相同接口作为编解码器实例的 / 方法（请参阅编解码器接口）。这些函数/方法应在无状态模式下工作。.encode.decode

stream_reader和stream_writer

这些需要是具有以下功能的工厂功能接口：

factory(stream,errors='strict')

工厂函数必须返回提供由 / 定义的接口（参见编解码器接口）。流编解码器可以维护状态。StreamWriterStreamReader

错误的可能值在编解码器部分中定义下面。

如果搜索函数找不到给定的编码，它应该返回 None。

对编码的别名支持留给搜索函数实现。

编解码器模块将维护编码缓存以提高性能原因。首先在缓存中查找编码。如果不是找到，将扫描已注册的搜索功能列表。如果没有找到编解码器元组，引发 LookupError。否则，编解码器元组存储在缓存中并返回给调用方。

若要查询编解码器实例，应使用以下 API：

codecs.lookup(encoding)

这将返回找到的编解码器元组或引发 .LookupError

标准编解码器

标准编解码器应位于 encodings/ package 目录中在标准 Python 代码库中。那个文件目录应包含与编解码器查找兼容的搜索功能实现基于延迟模块的编解码器查找。__init__.py

Python 应该为最相关的编码，例如

'utf-8':              8-bit variable length encoding
'utf-16':             16-bit variable length encoding (little/big endian)
'utf-16-le':          utf-16 but explicitly little endian
'utf-16-be':          utf-16 but explicitly big endian
'ascii':              7-bit ASCII codepage
'iso-8859-1':         ISO 8859-1 (Latin 1) codepage
'unicode-escape':     See Unicode Constructors for a definition
'raw-unicode-escape': See Unicode Constructors for a definition
'native':             Dump of the Internal Format used by Python

默认情况下，还应提供通用别名，例如 “latin-1”代表“iso-8859-1”。

注意：'utf-16' 应该通过使用和 require byte 来实现文件输入/输出的订单标记（BOM）。

所有其他编码，例如支持亚洲文字的 CJK 编码应该在单独的包中实现，这些包不会得到包含在核心 Python 发行版中，并且不是这个提议。

编解码器接口定义

以下基类应在模块“codecs”中定义。它们不仅提供编码模块使用的模板实现者，还要定义 Unicode 实现。

请注意，此处定义的编解码器接口非常适合应用范围更广。Unicode 实现期望输入和和字符的 Unicode 对象在输入上缓冲兼容的对象。的输出和应该是 Python 字符串，并且必须返回一个 Unicode 对象。.encode().write().decode().encode().read().decode()

首先，我们有无状态编码器/解码器。这些都不起作用像流编解码器（见下文）一样在块中，因为所有组件应在内存中可用。

class Codec: 
   """Defines the interface for stateless encoders/decoders.    
   
      The .encode()/.decode() methods may implement different       
      error handling schemes by providing the errors argument.       
      These string values are defined:         
      
        'strict'  - rAIse an error (or a subclass)         
        'ignore'  - ignore the character and continue with the next         
        'replace' - replace with a suitable replacement character;              
               Python will use the official U+FFFD                     
               REPLACEMENT CHARACTER for the builtin Unicode                     
               codecs.    
        """

    def encode(self,input,errors='strict'):     
      """Encodes the object input and returns a tuple (output         
        object, length consumed).           
        
        errors defines the error handling to apply.  It           
        defaults to 'strict' handling.           
        
        The method may not store state in the Codec instance.           
        Use StreamCodec for codecs which have to keep state in           
        order to make encoding/decoding efficient.        
    """

    def decode(self,input,errors='strict'):      
      """Decodes the object input and returns a tuple (output           
      object, length consumed).           
      
      input must be an object which provides the           
      bf_getreadbuf buffer slot.  Python strings, buffer           
      objects and memory mapped files are examples of objects           
      providing this slot.           
      
      errors defines the error handling to apply.  It           
      defaults to 'strict' handling.           
      
      The method may not store state in the Codec instance.           
      Use StreamCodec for codecs which have to keep state in           
      order to make encoding/decoding efficient.        
    """

StreamWriter并定义有状态接口编码器/解码器，用于流。这些允许处理以块为单位的数据，以有效使用内存。如果您有内存中的大字符串，您可能希望用对象包装它们，然后在它们上使用这些编解码器来执行 chunk 处理，例如向用户。StreamReadercStringIO

class StreamWriter(Codec):

    def __init__(self,stream,errors='strict'):     
       """Creates a StreamWriter instance.        
          stream must be a file-like object open for writing           
          (binary) data.           
          
          The StreamWriter may implement different error handling           
          schemes by providing the errors keyword argument.           
          These parameters are defined:           
            'strict' - raise a ValueError (or a subclass)             
            'ignore' - ignore the character and continue with the next             
            'replace'- replace with a suitable replacement character        
        """
        self.stream = stream
        self.errors = errors

    def write(self,object):      
      
        """Writes the object's contents encoded to self.stream.        
        """
        data, consumed = self.encode(object,self.errors)
        self.stream.write(data)

    def writelines(self, list):        
        
        """Writes the concatenated list of strings to the stream           
           using .write().        
        """
        self.write(''.join(list))

    def reset(self):     
       """Flushes and resets the codec buffers used for keeping state.        
          Calling this method should ensure that the data on the           
          output is put into a clean state, that allows appending           
          of new fresh data without having to rescan the whole           
          stream to recover state.        
        """
        pass

    def __getattr__(self,name, getattr=getattr):        
       """Inherit all other methods from the underlying stream.        
       """
       return getattr(self.stream,name)
       
class StreamReader(Codec):

    def __init__(self,stream,errors='strict'):    
        """Creates a StreamReader instance.        
           stream must be a file-like object open for reading           
           (binary) data.           
           
           The StreamReader may implement different error handling           
           schemes by providing the errors keyword argument.           
           These parameters are defined:           
           
             'strict' - raise a ValueError (or a subclass)             
             'ignore' - ignore the character and continue with the next             
             'replace'- replace with a suitable replacement character;        """
        self.stream = stream
        self.errors = errors

    def read(self,size=-1):     
       """Decodes data from the stream self.stream and returns the          
          resulting object.           
          
          size indicates the approximate maximum number of bytes           
          to read from the stream for decoding purposes.  The           
          decoder can modify this setting as appropriate.  The           
          default value -1 indicates to read and decode as much           
          as possible.  size is intended to prevent having to           
          decode huge files in one step.           
          
          The method should use a greedy read strategy meaning           
          that it should read as much data as is allowed within           
          the definition of the encoding and the given size, e.g.           
          if optional encoding endings or state markers are           
          available on the stream, these should be read too.        
        """
        # Unsliced reading:
        if size < 0:
            return self.decode(self.stream.read())[0]

        # Sliced reading:
        read = self.stream.read
        decode = self.decode
        data = read(size)
        i = 0
        while 1:
            try:
                object, decodedbytes = decode(data)
            except ValueError,why:
                # This method is slow but should work under pretty
                # much all conditions; at most 10 tries are made
                i = i + 1
                newdata = read(1)
                if not newdata or i > 10:
                    raise
                data = data + newdata
            else:
                return object

    def readline(self, size=None):      
      """Read one line from the input stream and return the        
         decoded data.           
         
         Note: Unlike the .readlines() method, this method           
         inherits the line breaking knowledge from the           
         underlying stream's .readline() method -- there is           
         currently no support for line breaking using the codec           
         decoder due to lack of line buffering.  Subclasses           
         should however, if possible, try to implement this           
         method using their own knowledge of line breaking.           
         
         size, if given, is passed as size argument to the           
         stream's .readline() method.        
      """
        if size is None:
            line = self.stream.readline()
        else:
            line = self.stream.readline(size)
        return self.decode(line)[0]

    def readlines(self, sizehint=0):      
      """Read all lines available on the input stream        
         and return them as list of lines.           
         
         Line breaks are implemented using the codec's decoder           
         method and are included in the list entries.           
         
         sizehint, if given, is passed as size argument to the           
         stream's .read() method.        
      """
        if sizehint is None:
            data = self.stream.read()
        else:
            data = self.stream.read(sizehint)
        return self.decode(data)[0].splitlines(1)

    def reset(self):     
       """Resets the codec buffers used for keeping state.         
          
          Note that no stream repositioning should take place.           
          This method is primarily intended to be able to recover           
          from decoding errors.        
          
       """
        pass

    def __getattr__(self,name, getattr=getattr):     
       """ Inherit all other methods from the underlying stream.        
       """
       return getattr(self.stream,name)

流编解码器实现者可以自由地将和接口组合到一个类中。即使将所有这些结合起来使用 Codec 类应该是可能的。StreamWriterStreamReader

实现者可以自由添加其他方法来增强编解码器功能或提供所需的额外状态信息他们去工作。内部编解码器实现将仅使用不过，上面的接口。

Unicode 实现不需要使用这些基本类，只有接口必须匹配;这允许写入编解码器作为扩展类型。

作为准则，大型映射表应使用静态 C 数据在单独的（共享）扩展模块中。那边多个进程可以共享相同的数据。

用于自动将 Unicode 映射文件转换为映射模块的工具应提供以简化对其他映射的支持（见参考资料）。

空白

该方法必须知道所考虑的内容 Unicode 中的空格。.split()

案例转换

Unicode 数据的大小写转换相当复杂，因为有许多不同的条件需要尊重。

有关实现大小写转换的一些指南。

对于 Python，我们应该只实现包含的 1-1 转换在 Unicode 中。与区域设置相关的转换和其他特殊情况转换（请参阅 Unicode 标准文件 SpecialCasing.txt）应保留到用户登陆例程，而不是进入核心解释器。

方法和应遵循上述技术报告中将案例映射算法定义为尽可能接近。.capitalize().iscapitalized()

换行符

应对所有具有 B 属性以及组合 CRLF、CR、LF（解释按此顺序）和标准。

Unicode 类型应提供一种方法，该方法根据上述规范返回行列表。看 Unicode 方法。.splitlines()

Unicode 字符属性

一个单独的模块“unicodedata”应该提供一个紧凑的接口到标准中定义的所有 Unicode 字符属性 UnicodeData.txt文件。

除其他外，这些属性提供了识别的方法数字、数字、空格、空格等。

由于此模块必须提供对所有Unicode的访问字符，它最终必须包含来自 UnicodeData.txt占用大约 600kB。因此，数据应存储在静态 C 数据中。这样就可以编译了作为底层操作系统可以在进程（与普通的 Python 代码模块不同）。

应该有一个标准的 Python 接口来访问它信息，以便其他实现者可以插入自己的可能的增强版本，例如对即时数据。

专用代码点区域

对这些的支持留给用户土地编解码器，而不是显式集成到核心中。请注意，由于内部格式正在实现，只有和之间的区域是可用于私有编码。\uE000\uF8FF

内部格式

Unicode 对象的内部格式应使用 Python 特定的固定格式 <PythonUnicode>实现为“无符号” short“（或其他具有 16 位的无符号数字类型）。字节订单取决于平台。

此格式将包含相应 Unicode 序号。Python Unicode 实现将解决这些值就好像它们是 UCS-2 值一样。UCS-2 和 UTF-16 是对于所有当前定义的 Unicode 字符点，都是一样的。不带代理项的 UTF-16 提供对大约 64k 个字符的访问并涵盖基本多语言平面（BMP）中的所有字符 Unicode的。

编解码器有责任确保他们传递的数据到 Unicode 对象构造函数遵循此假设。这构造函数不检查数据是否符合 Unicode 或使用的代理人。

未来的实现可以将 32 位限制扩展到全套所有 UTF-16 可寻址字符（约 1M 字符）。

Unicode API 应提供来自 <PythonUnicode>到编译器的wchar_t，可以是 16 或 32 bit，具体取决于所使用的编译器/libc/平台。

Unicode 对象应具有指向缓存的 Python 字符串的指针对象 <defenc>使用 <default 保存对象的值编码>。这是性能和内部分析所必需的（请参阅内部参数解析）原因。缓冲区已填满当第一个转换请求<默认编码>为在对象上发出。

（目前）不需要实习，因为 Python 标识符是定义为仅 ASCII。

codecs.BOM应返回格式的字节顺序标记（BOM）内部使用。编解码器模块应提供以下内容为方便起见和参考，其他常量（将 BE 或取决于平台）：codecs.BOMBOM_BEBOM_LE

BOM_BE: '\376\377'
  (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
   platforms == ZERO WIDTH NO-BREAK SPACE)
   
BOM_LE: '\377\376'
  (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
   platforms == defined as being an illegal Unicode character)
   
BOM4_BE: '\000\000\376\377'
  (corresponds to Unicode U+0000FEFF in UCS-4)
  
BOM4_LE: '\377\376\000\000'
  (corresponds to Unicode U+0000FFFE in UCS-4)

请注意，Unicode 将大端字节顺序视为“正确”。交换的订单被视为“错误”的指标格式，因此是非法的字符定义。

configure 脚本应帮助确定 Python 是否是否可以使用本机类型（它必须是 16 位无符号类型）。wchar_t

缓冲区接口

使用 <defenc> Python 字符串实现缓冲区接口对象作为的基础和的内部缓冲区。如果请求，并且 <defenc> 对象尚不存在，它是首先创建的。bf_getcharbufbf_getreadbufbf_getcharbuf

请注意，作为特殊情况，解析器标记“s#”不会返回原始 Unicode UTF-16 数据（返回），但而是尝试使用默认的 Unicode 对象对 Unicode 对象进行编码编码，然后返回指向生成的字符串对象的指针（或在转换失败时引发异常）。这是这样做是为了防止意外地将二进制数据写入另一端可能无法识别的输出流。bf_getreadbuf

这样做的优点是能够写入输出流（通常使用此接口）无需额外的要使用的编码的规范。

如果需要访问Unicode的读取缓冲区接口对象，请使用接口。PyObject_AsReadBuffer()

内部格式也可以使用 'Unicode-internal' 编解码器，例如通过 .u.encode('unicode-internal')

泡菜/编组

应具有本机 Unicode 对象支持。对象应该是使用独立于平台的编码进行编码。

Marshal 应该使用 UTF-8，而 Pickle 应该选择 Raw-Unicode-Escape（在文本模式下）或 UTF-8（在二进制模式下）作为编码。使用 UTF-8 而不是 UTF-16 的优点是无需存储 BOM 标记。

正则表达式

Secret Labs AB 正在研究一种可识别 Unicode 的正则表达式机械。它适用于普通 8 位、UCS-2 和（可选）UCS-4 内部字符缓冲区。

有关如何处理 Unicode RE 的一些评论。

格式标记

格式标记用于 Python 格式字符串。如果 Python 字符串用作格式字符串，解释如下应生效：

'%s': For Unicode objects this will cause coercion of the
      whole format string to Unicode.  Note that you should use
      a Unicode format string to start with for performance
      reasons.

如果格式字符串是 Unicode 对象，则所有参数都是首先强制使用 Unicode，然后放在一起并格式化根据格式字符串。数字首先转换为字符串，然后转换为 Unicode。

'%s': Python strings are interpreted as Unicode
      string using the <default encoding>.  Unicode objects are
      taken as is.

所有其他字符串格式化程序都应相应地工作。

例：

u"%s %s" % (u"abc", "abc")  ==  u"abc abc"

内部参数解析

这些标记由 API 使用：PyArg_ParseTuple()

“U”
检查 Unicode 对象并返回指向它的指针
“s”
对于 Unicode 对象：返回指向对象的 <defenc> 缓冲区（使用 <默认编码>）。
“s#”
访问 Unicode 对象的默认编码版本（请参阅缓冲区接口）;请注意，长度与默认编码字符串的长度，而不是 Unicode 对象长度。
“t#”
与“s#”相同。
“es”
采用两个参数：encoding （）和 buffer ().const char *char **
输入对象首先被强制为 Unicode，通常方式，然后使用给定的编码。
在输出时，分配所需大小的缓冲区，并作为以 NULL 结尾的字符串返回。这 encoded 可能不包含嵌入的 NULL 字符。这调用者负责调用以释放使用后分配。*bufferPyMem_Free()*buffer
“es#”
采用三个参数：encoding （）、buffer （）和 buffer_len （）。const char *char **int *
输入对象首先被强制为 Unicode，通常方式，然后使用给定的编码。
如果为非 NULL，则必须设置为 on input。然后将输出复制到。*buffer*buffer_lensizeof(buffer)*buffer
如果为 NULL，则所需大小的缓冲区为分配并将输出复制到其中。然后是更新为指向分配的内存区域。调用方负责调用以释放使用后分配。*buffer*bufferPyMem_Free()*buffer
在这两种情况下，都更新为写入的字符（不包括尾随的 NULL 字节）。输出缓冲区确保以 NULL 结尾。*buffer_len

例子：

使用带有自动分配的“es#”：

static PyObject *
test_parser(PyObject *self,
            PyObject *args)
{
    PyObject *str;
    const char *encoding = "latin-1";
    char *buffer = NULL;
    int buffer_len = 0;

    if (!PyArg_ParseTuple(args, "es#:test_parser",
                          encoding, &buffer, &buffer_len))
        return NULL;
    if (!buffer) {
        PyErr_SetString(PyExc_SystemError,
                        "buffer is NULL");
        return NULL;
    }
    str = PyString_FromStringAndSize(buffer, buffer_len);
    PyMem_Free(buffer);
    return str;
}

使用带有自动分配的“es”返回以 NULL 结尾的字符串：

static PyObject *
test_parser(PyObject *self,
            PyObject *args)
{
    PyObject *str;
    const char *encoding = "latin-1";
    char *buffer = NULL;

    if (!PyArg_ParseTuple(args, "es:test_parser",
                          encoding, &buffer))
        return NULL;
    if (!buffer) {
        PyErr_SetString(PyExc_SystemError,
                        "buffer is NULL");
        return NULL;
    }
    str = PyString_FromString(buffer);
    PyMem_Free(buffer);
    return str;
}

将“es#”与预先分配的缓冲区一起使用：

static PyObject *
test_parser(PyObject *self,
            PyObject *args)
{
    PyObject *str;
    const char *encoding = "latin-1";
    char _buffer[10];
    char *buffer = _buffer;
    int buffer_len = sizeof(_buffer);

    if (!PyArg_ParseTuple(args, "es#:test_parser",
                          encoding, &buffer, &buffer_len))
        return NULL;
    if (!buffer) {
        PyErr_SetString(PyExc_SystemError,
                        "buffer is NULL");
        return NULL;
    }
    str = PyString_FromStringAndSize(buffer, buffer_len);
    return str;
}

文件/流输出

由于 file.write（object）和大多数其他流编写器使用 “s#”或“t#”参数解析标记，用于将数据查询到 write，Unicode 对象的默认编码字符串版本将被写入流（请参阅缓冲区接口）。

为了使用 Unicode 显式处理文件，标准流应使用通过编解码器模块提供的编解码器。

编解码器模块应提供快捷方式 open（filename，mode，encoding）可用，这也确保了 mode 在需要时包含“b”字符。

文件/流输入

只有用户知道输入数据使用什么编码，所以没有应用了特殊的魔法。用户必须显式根据需要将字符串数据转换为 Unicode 对象，或使用编解码器模块中定义的文件包装器（请参阅 File/Stream 输出）。

Unicode 方法和属性

所有 Python 字符串方法，以及：

.encode([encoding=<default encoding>][,errors="strict"])
   --> see Unicode Output
 
.splitlines([include_breaks=0])
   --> breaks the Unicode string into a list of (Unicode) lines;
       returns the lines with line breaks included, if
       include_breaks is true.  See Line Breaks for a
       specification of how line breaking is done.

代码库

我们应该使用 Fredrik Lundh 的 Unicode 对象实现作为基础。它已经实现了所需的大多数字符串方法并提供了一个编写良好的代码库，我们可以在此基础上进行构建。

在 Fredrik 的实现中实现的对象共享应该被丢弃。

测试用例

测试用例应遵循 Lib/test/test_string.py 和包括对编解码器注册表和标准的其他检查编解码器。

本提案的历史

[编者注：1.7 之前的修订版可在 CVS 历史记录中找到来自标准 Python 发行版的 Misc/unicode.txt。都后续历史记录可通过 CVS 修订版获得文件。

1.7

添加了有关“s#”行为更改的注释。

1.6

将 <defencstr> 更改为 <defenc>因为这是实现。
添加了有关 <defenc> 在缓冲协议实现。

1.5

添加了有关设置<默认编码>的说明。
修复了一些错别字（感谢 Andrew Kuchling）。
将 <defencstr> 更改为 <utf8str>。

1.4

添加了有关混合类型比较的注释，并包含测试。
更改了格式字符串中 Unicode 对象的处理方式（如果与它们一起使用现在将导致格式字符串被强制为 Unicode，从而在返回）。'%s' % u
添加了指向 IANA 字符集名称的链接（感谢 Lars 马里乌斯·加绍尔）。
添加了新的编解码器方法和。.readline().readlines().writelines()

1.3

添加了新的“es”和“es#”解析器标记

1.2

删除了 POD 关于codecs.open()

1.1

添加了有关比较和哈希值的注释。
添加了有关案例映射算法的说明。
更改了流编解码器和方法匹配标准类似文件的对象方法（已消耗的字节数信息不再返回方法）.read().write()

1.0

将 encode Codec 方法更改为 decode 方法对称（它们现在都返回（对象，消耗的数据），因此成为可互换）;
删除了 Codec 类的方法（方法是无状态的），并将 errors 参数向下移动到方法;__init__
使编解码器设计更通用，带 w/r 键入输入和输出对象;
更改为为了避免覆盖流的方法;StreamWriter.flushStreamWriter.reset.flush()
重命名为.breaklines().splitlines();
将模块 Unicodec 重命名为 Codecs;
修改了“文件 I/O”部分以引用流编解码器。

0.9

更改了错误关键字参数定义;
添加了“替换”错误处理;
更改了编解码器 API 以接受缓冲区，例如输入上的对象;
一些小的错别字修复;
添加了空格部分，并包含对 Unicode 字符的引用具有空格和换行符特征;
添加了注释，即搜索函数可以预期小写编码名称;
编解码器 API 中删除的切片和偏移量

0.8

添加了编码包和原始 Unicode 转义编码;
未表化提案;
添加了对Unicode格式字符串的注释;
添加方法.breaklines()

0.7

添加了一套全新的编解码器 API;
添加了不同的编码器查找方案;
修复了一些名称

0.6

将“s#”更改为“t#”;
将<defencbuf>更改为<defencstr> holding 一个真正的 Python 字符串对象;
将 Buffer Interface 更改为将请求委托给 <defencstr> 的缓冲区接口;
删除了对 Unicodec.codecs 字典的显式引用（模块可以以适合目的的方式实现这一点）;
删除了可设置的默认编码;
从 Unicodec 移动到例外;UnicodeError
“s#”不返回内部数据;
通过了 Unicode 构造函数的 UCS-2/UTF-16 检查到编解码器

0.5

移至sys.bomunicodec.BOM;
添加了有关案例映射的章节，
专用编码和 Unicode 字符属性

0.4

添加了编解码器接口，关于%格式的说明，
更改了一些编码细节，
添加了对流包装器的评论，
修复了一些讨论点（最重要的是：内部格式），
阐明了“unicode-escape”编码，添加了编码引用

0.3

添加了引用、编解码器模块注释、内部格式、 bf_getcharbuffer 和 RE 引擎;
添加了“Unicode-Escape” Tim Peters 提出的编码并相应地修复了 repr（u）

0.2

集成了 Guido 的建议，添加了流编解码器和文件包装

0.1

第一版

The End

Python PEP 兼容通用代码编码

文章声明：以上内容(如有图片或视频在内)除非注明，否则均为腾龙猫勺儿原创文章，转载或复制请以超链接形式并注明出处。

本文作者：猫勺本文链接：https://www.jo6.cn/post/86.html

PEP 100 – Python Unicode 集成

Unicode 默认编码

本提案的历史

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

猫勺

最近发表

标签列表

PEP 100 – Python Unicode 集成

Unicode 默认编码

本提案的历史

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

相关阅读

最近发表

标签列表