`
hqs7636
  • 浏览: 215843 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

std.encoding

阅读更多


Jump to: INVALID_SEQUENCE AsciiChar AsciiString Latin1Char Latin1String Windows1252Char Windows1252String isValidCodePoint encodingName canEncode isValidCodeUnit isValid validLength sanitize firstSequence lastSequence count index decode decodeReverse safeDecode encodedLength encode codePoints codeUnits transcode EncodingException EncodingScheme register create toString names replacementSequence EncodingSchemeASCII EncodingSchemeLatin1 EncodingSchemeWindows1252 EncodingSchemeUtf8 EncodingSchemeUtf16Native EncodingSchemeUtf32Native
Classes and functions for handling and transcoding between various encodings.

For cases where the encoding is known at compile-time, functions are provided for arbitrary encoding and decoding of characters, arbitrary transcoding between strings of different type, as well as validation and sanitization.

Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), and WINDOWS-1252.

    * The type AsciiChar represents an ASCII character.
    * The type AsciiString represents an ASCII string.
    * The type Latin1Char represents an ISO-8859-1 character.
    * The type Latin1String represents an ISO-8859-1 string.
    * The type Windows1252Char represents a Windows-1252 character.
    * The type Windows1252String represents a Windows-1252 string.

For cases where the encoding is not known at compile-time, but is known at run-time, we provide the abstract class EncodingScheme and its subclasses. To construct a run-time encoder/decoder, one does e.g.

    auto e = EncodingScheme.create("utf-8");

This library supplies EncodingScheme subclasses for ASCII, ISO-8859-1 (also known as LATIN-1), WINDOWS-1252, UTF-8, and (on little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian architectures) UTF-16BE and UTF-32BE.

This library provides a mechanism whereby other modules may add EncodingScheme subclasses for any other encoding.

Authors:
Janice Caron

Date:
2008.02.27 - 2008.05.07

License:
Public Domain

dchar INVALID_SEQUENCE;
    Special value returned by safeDecode

typedef AsciiChar;
alias AsciiString;
    Defines various character sets.

typedef Latin1Char;
    Defines an Latin1-encoded character.

alias Latin1String;
    Defines an Latin1-encoded string (as an array of invariant(Latin1Char)).

typedef Windows1252Char;
    Defines a Windows1252-encoded character.

alias Windows1252String;
    Defines an Windows1252-encoded string (as an array of invariant(Windows1252Char)).

bool isValidCodePoint(dchar c);
    Returns true if c is a valid code point

    Note that this includes the non-character code points U+FFFE and U+FFFF, since these are valid code points (even though they are not valid characters).

    Supercedes:
    This function supercedes std.utf.startsValidDchar().

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    dchar c the code point to be tested

string encodingName(T)();
    Returns the name of an encoding.

    The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Examples:

     assert(encodingName!(Latin1Char) == "ISO-8859-1");

bool canEncode(E)(dchar c);
    Returns true iff it is possible to represent the specifed codepoint in the encoding.

    The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Examples:

     assert(canEncode!(Latin1Char)('A'));

bool isValidCodeUnit(E)(E c);
    Returns true if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    c the code unit to be tested

bool isValid(E)(const(E)[] s);
    Returns true if the string is encoded correctly

    Supercedes:
    This function supercedes std.utf.validate(), however note that this function returns a bool indicating whether the input was valid or not, wheras the older funtion would throw an exception.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be tested

uint validLength(E)(const(E)[] s);
    Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be tested

immutable(E)[] sanitize(E)(immutable(E)[] s);
    Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.

    If the input string is already valid, this function returns the original, otherwise it constructs a new string by replacing all illegal code unit sequences with the encoding's replacement character, Invalid sequences will be replaced with the Unicode replacement character (U+FFFD) if the character repertoire contains it, otherwise invalid sequences will be replaced with '?'.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be sanitized

uint firstSequence(E)(const(E)[] s);
    Returns the length of the first encoded sequence.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be sliced

uint lastSequence(E)(const(E)[] s);
    Returns the length the last encoded sequence.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be sliced

uint count(E)(const(E)[] s);
    Returns the total number of code points encoded in a string.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Supercedes:
    This function supercedes std.utf.toUCSindex().

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be counted

int index(E)(const(E)[] s, int n);
    Returns the array index at which the (n+1)th code point begins.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Supercedes:
    This function supercedes std.utf.toUTFindex().

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be counted

dchar decode(S)(ref S s);
    Decodes a single code point.

    This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Supercedes:
    This function supercedes std.utf.decode(), however, note that the function codePoints() supercedes it more conveniently.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string whose first code point is to be decoded

dchar decodeReverse(E)(ref const(E)[] s);
    Decodes a single code point from the end of a string.

    This function removes one or more code units from the end of a string, and returns the decoded code point which those code units represent.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string whose first code point is to be decoded

dchar safeDecode(S)(ref S s);
    Decodes a single code point. The input does not have to be valid.

    This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent.

    This function will accept an invalidly encoded string as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string whose first code point is to be decoded

uint encodedLength(E)(dchar c);
    Returns the number of code units required to encode a single code point.

    The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

    The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    c the code point to be encoded

E[] encode(E)(dchar c);
    Encodes a single code point.

    This function encodes a single code point into one or more code units. It returns a string containing those code units.

    The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

    The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

    Supercedes:
    This function supercedes std.utf.encode(), however, note that the function codeUnits() supercedes it more conveniently.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    c the code point to be encoded

uint encode(E)(dchar c, E[] array);
    Encodes a single code point into an array.

    This function encodes a single code point into one or more code units The code units are stored in a user-supplied fixed-size array, which must be passed by reference.

    The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

    The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

    Supercedes:
    This function supercedes std.utf.encode(), however, note that the function codeUnits() supercedes it more conveniently.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    c the code point to be encoded

    Returns:
    the number of code units written to the array

uint encode(E, R)(dchar c, R range);
    Encodes c in units of type E and writes the result to the output range R. Returns the number of Es written.

void encode(E)(dchar c, void delegate(E) dg);
    Encodes a single code point to a delegate.

    This function encodes a single code point into one or more code units. The code units are passed one at a time to the supplied delegate.

    The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

    The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

    Supercedes:
    This function supercedes std.utf.encode(), however, note that the function codeUnits() supercedes it more conveniently.

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    c the code point to be encoded

CodePoints!(E) codePoints(E)(immutable(E)[] s);
    Returns a foreachable struct which can bidirectionally iterate over all code points in a string.

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    You can foreach either with or without an index. If an index is specified, it will be initialized at each iteration with the offset into the string at which the code point begins.

    Supercedes:
    This function supercedes std.utf.decode().

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the string to be decoded

    Examples:

     string s = "hello world";
     foreach(c;codePoints(s))
     {
         // do something with c (which will always be a dchar)
     }

    Note that, currently, foreach(c:codePoints(s)) is superior to foreach(c;s) in that the latter will fall over on encountering U+FFFF.

CodeUnits!(E) codeUnits(E)(dchar c);
    Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.

    The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

    The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type in the template parameter.

    Supercedes:
    This function supercedes std.utf.encode().

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    d the code point to be encoded

    Examples:

     dchar d = '\u20AC';
     foreach(c;codeUnits!(char)(d))
     {
         writefln("%X",c)
     }
     // will print
     // E2
     // 82
     // AC

uint encode(Tgt, Src, R)(in Src[] s, R range);
    Encodes c in units of type E and writes the result to the output range R. Returns the number of Es written.

void transcode(Src, Dst)(immutable(Src)[] s, out immutable(Dst)[] r);
    Convert a string from one encoding to another. (See also to!() below).

    The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

    Supercedes:
    This function supercedes std.utf.toUTF8(), std.utf.toUTF16() and std.utf.toUTF32() (but note that to!() supercedes it more conveniently).

    Standards:
    Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

    Parameters:
    s the source string
    r the destination string

    Examples:

     wstring ws;
     transcode("hello world",ws);
         // transcode from UTF-8 to UTF-16

     Latin1String ls;
     transcode(ws, ls);
         // transcode from UTF-16 to ISO-8859-1

class EncodingException: object.Exception;
    The base class for exceptions thrown by this module

abstract class EncodingScheme;
    Abstract base class of all encoding schemes

    static void register(string className);
        Registers a subclass of EncodingScheme.

        This function allows user-defined subclasses of EncodingScheme to be declared in other modules.

        Examples:

         class Amiga1251 : EncodingScheme
         {
             static this()
             {
                 EncodingScheme.register("path.to.Amiga1251");
             }
         }

    static EncodingScheme create(string encodingName);
        Obtains a subclass of EncodingScheme which is capable of encoding and decoding the named encoding scheme.

        This function is only aware of EncodingSchemes which have been registered with the register() function.

        Examples:

         auto scheme = EncodingScheme.create("Amiga-1251");

    abstract const string toString();
        Returns the standard name of the encoding scheme

    abstract const immutable(char)[][] names();
        Returns an array of all known names for this encoding scheme

    abstract const bool canEncode(dchar c);
        Returns true if the character c can be represented in this encoding scheme.

    abstract const uint encodedLength(dchar c);
        Returns the number of ubytes required to encode this code point.

        The input to this function MUST be a valid code point.

        Parameters:
        dchar c the code point to be encoded

        Returns:
        the number of ubytes required.

    abstract const uint encode(dchar c, ubyte[] buffer);
        Encodes a single code point into a user-supplied, fixed-size buffer.

        This function encodes a single code point into one or more ubytes. The supplied buffer must be code unit aligned. (For example, UTF-16LE or UTF-16BE must be wchar-aligned, UTF-32LE or UTF-32BE must be dchar-aligned, etc.)

        The input to this function MUST be a valid code point.

        Parameters:
        dchar c the code point to be encoded

        Returns:
        the number of ubytes written.

    abstract const dchar decode(ref const(ubyte)[] s);
        Decodes a single code point.

        This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent.

        The input to this function MUST be validly encoded.

        Parameters:
        const(ubyte)[] s the array whose first code point is to be decoded

    abstract const dchar safeDecode(ref const(ubyte)[] s);
        Decodes a single code point. The input does not have to be valid.

        This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent.

        This function will accept an invalidly encoded array as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.

        Parameters:
        const(ubyte)[] s the array whose first code point is to be decoded

    abstract const immutable(ubyte)[] replacementSequence();
        Returns the sequence of ubytes to be used to represent any character which cannot be represented in the encoding scheme.

        Normally this will be a representation of some substitution character, such as U+FFFD or '?'.

    bool isValid(const(ubyte)[] s);
        Returns true if the array is encoded correctly

        Parameters:
        const(ubyte)[] s the array to be tested

    uint validLength(const(ubyte)[] s);
        Returns the length of the longest possible substring, starting from the first element, which is validly encoded.

        Parameters:
        const(ubyte)[] s the array to be tested

    immutable(ubyte)[] sanitize(immutable(ubyte)[] s);
        Sanitizes an array by replacing malformed ubyte sequences with valid ubyte sequences. The result is guaranteed to be valid for this encoding scheme.

        If the input array is already valid, this function returns the original, otherwise it constructs a new array by replacing all illegal sequences with the encoding scheme's replacement sequence.

        Parameters:
        immutable(ubyte)[] s the string to be sanitized

    uint firstSequence(const(ubyte)[] s);
        Returns the length of the first encoded sequence.

        The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

        Parameters:
        const(ubyte)[] s the array to be sliced

    uint count(const(ubyte)[] s);
        Returns the total number of code points encoded in a ubyte array.

        The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

        Parameters:
        const(ubyte)[] s the string to be counted

    int index(const(ubyte)[] s, int n);
        Returns the array index at which the (n+1)th code point begins.

        The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

        Parameters:
        const(ubyte)[] s the string to be counted

class EncodingSchemeASCII: std.encoding.EncodingScheme;
    EncodingScheme to handle ASCII

    This scheme recognises the following names: "ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "IBM367", "ISO646-US", "ISO_646.irv:1991", "US-ASCII", "cp367", "csASCII" "iso-ir-6", "us"

class EncodingSchemeLatin1: std.encoding.EncodingScheme;
    EncodingScheme to handle Latin-1

    This scheme recognises the following names: "CP819", "IBM819", "ISO-8859-1", "ISO_8859-1", "ISO_8859-1:1987", "csISOLatin1", "iso-ir-100", "l1", "latin1"

class EncodingSchemeWindows1252: std.encoding.EncodingScheme;
    EncodingScheme to handle Windows-1252

    This scheme recognises the following names: "windows-1252"

class EncodingSchemeUtf8: std.encoding.EncodingScheme;
    EncodingScheme to handle UTF-8

    This scheme recognises the following names: "UTF-8"

class EncodingSchemeUtf16Native: std.encoding.EncodingScheme;
    EncodingScheme to handle UTF-16 in native byte order

    This scheme recognises the following names: "UTF-16LE" (little-endian architecture only) "UTF-16BE" (big-endian architecture only)

class EncodingSchemeUtf32Native: std.encoding.EncodingScheme;
    EncodingScheme to handle UTF-32 in native byte order

    This scheme recognises the following names: "UTF-32LE" (little-endian architecture only) "UTF-32BE" (big-endian architecture only)
分享到:
评论

相关推荐

    Microprocessor Design Vhdl.pdf

    Contents 1. Designing a Microprocessor.................................................................................................................................2 1.1 Overview of a ...

    Microprocessor Design Principles and Practices With VHDL

    Designing a Microprocessor.................................................................................................................................2 1.1 Overview of a Microprocessor.............

    厦门大学硕博论文xelatex模板

    # encoding: UTF-8 用的字体: 英文: serif: Times New Roman PS Std sans-serif: Mosquito Formal Std monospace: Lucida Sans Typewriter Std 中文: 宋体:Adobe Song Std 黑体:Adobe Heiti Std 楷体...

    nao机器人java语音源码

    /// This method performs the text-to-speech operations: it takes a std::string, outputs the synthesis resulting audio signal in a file, and then plays the audio file. The file is deleted afterwards. ...

    cesu8:用于在CESU-8和UTF-8之间转换的库

    use std :: borrow :: Cow; let str = "Hello, world!" ; assert_eq! (cesu8 :: encode (STR), Cow :: Borrowed (STR. as_bytes ())); assert_eq! (cesu8 :: decode (STR. as_bytes ()). unwrap (), Cow :: Borr

    eu4save:符合人体工程学的EU4保存库(ironman + mp)

    use eu4save :: {Eu4Extractor, Encoding, CountryTag}; use std :: io :: Cursor; let data = std :: fs :: read ( "assets/saves/eng.txt.compressed.eu4" )?; let (save, encoding) = Eu4Extractor :: extract_...

    cppjosa:c ++ 11韩文调查处理

    std::wstring srcText = System::Text::UTF8Encoding.GetString("아노아(은)는 자루(와)과 오리(을)를 칭송하고 절(으)로 들어갔습니다."); Myevan::Korean::ReplaceJosa(srcText, dstText); std::vector<char> buf...

    hafumanshu

    using namespace std; typedef struct HuffmanNode{//结点结构 int weight; int parent,lchild,rchild; }*HfmNode; struct HuffmanTree{//哈弗曼树 HfmNode Node; char *Info;//存储字符,也可放在结点结构里...

    c++ 源代码 哈夫曼树 哈夫曼编码

    using namespace std; int main() { cout~~~~~~~~~~~~~welcome to Huffman encodrding&decoding system ~~~~~~~~~~~~~~~~~~~~\n\n"; cout; cout(1)Initialization \n"; cout(2) Encoding\n"; cout(3) ...

    compression::clamp:Deno压缩中间件

    特征使用Accept-Encoding标头检测受支持的编码支持链接算法(例如gzip > deflate ) 创建具有应用压缩的Content-Encoding标头如果不支持编码,则发送409 Not Acceptable例子import { serve } from '...

    Tinyxml 源代码(VC6 & VS2005)

    如果使用STL,TinyXML会使用std::string类,而且完全支持std::istream,std::ostream,operator和operator>>。许多API方法都有 ‘const char*’和’const std::string&’两个版本。 如果被编译成不使用STL,则任何...

    FastReport.v4.15 for.Delphi.BCB.Full.Source企业版含ClientServer中文修正版支持D4-XE5

    - fixed bug in ODF export with UTF8 encoding of the Creator field - fixed bug in XML export with processing special characters in strings - fixed bug in ODF export with properties table:number-columns...

    fast_ber:C ++ 11 ASN.1 BER编码和解码库

    fast_ber 用C ++ 11编写的高性能ASN.1 BER编码和解码库介绍fast_ber...提供视图类以实现零拷贝解码模拟STL类型的接口,例如std :: string,std :: vector和std :: optional局限性没有循环数据结构大小和值约束未实现工

    python在windows命令行下输出彩色文字的方法

    本文实例讲述了python在windows命令行下输出彩色文字的方法。分享给大家供大家参考。具体分析如下: ...#encoding: utf-8 import ctypes STD_INPUT_HANDLE = -10 STD_OUTPUT_HANDLE= -11 STD_ERROR_HANDLE

    vim插件打包

    let OmniCpp_DefaultNamespaces = ["std", "_GLIBCXX_STD"] " 自动关闭补全窗口 au CursorMovedI,InsertLeave * if pumvisible() == 0|silent! pclose|endif set completeopt=menuone,menu,longest set nocp "} "}}...

    clipp:易于使用,功能强大且富有表现力的命令行参数解析,可用于现代C ++单头用法和文档生成

    [-r] [-o ] [-utf16]OPTIONS -r, --recursive convert files recursively -utf16 use UTF-16 encoding 这是定义位置值input file和三个选项-r , -o和-utf16 。 如果解析失败,上述默认的类似于手册页的代码段将被...

    rust-lexical:词汇,字符串转换和字符串转换例程

    词汇的 针对std和no_std环境的快速词法转换例程。 Lexical提供了将数字与十进制字符串进行相互转换的例程。 词法简单易用,专注于性能和正确性。 最后,适用于没有内存分配器的环境,默认情况下不需要任何内部分配。...

    tide-compress:Tide Web框架的传出压缩中间件

    #[async_std::main]async fn main () -> tide:: Result { let mut app = tide :: new (); app. with (tide_compress :: CompressMiddleware :: new ());}特征支持Brotli,Gzip和Deflate编码,可通过货物特征标记...

    Microsoft Library MSDN4DOS.zip

    STD Set Direction Flag STI Set Interrupt Flag STOS/STOSB/STOSW/STOSD Store String Data STR Store Task Register SUB Integer Subtraction TEST Logical Compare VERR, VERW Verify a Segment for Reading or ...

Global site tag (gtag.js) - Google Analytics