`

Unicode and UTF8

阅读更多

What is Unicode?

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

 

UTF8

它其实和Unicode是同类,就是在编码方式上不同!
首先UTF8编码后的大小是不一定,不像Unicode编码后的大小是一样的!

我们先来看Unicode的编码:一个英文字母 “a” 和 一个汉字 “好”,编码后都是占用的空间大小是一样的,都是两个字节!

而UTF8编码:一个英文字母“a” 和 一个汉字 “好”,编码后占用的空间大小就不样了,前者是一个字节,后者是三个字节!

现在就让我们来看看UTF8编码的原理吧:
  因为一个字母还有一些键盘上的符号加起来只用二进制七位就可以表示出来,而一个字节就是八位,所以UTF8就用一个字节来表式字母和一些键盘上的符号。然而当我们拿到被编码后的一个字节后怎么知道它的组成?它有可能是英文字母的一个字节,也有可能是汉字的三个字节中的一个字节!所以,UTF8是有标志位的!

  当要表示的内容是 7位 的时候就用一个字节:0*******   第一个0为标志位,剩下的空间正好可以表示ASCII 0-127 的内容。

  当要表示的内容在 8 到 11 位的时候就用两个字节:110***** 10******   第一个字节的110和第二个字节的10为标志位。

  当要表示的内容在 12 到 16 位的时候就用三个字节:1110***** 10****** 10******    和上面一样,第一个字节的1110和第二、三个字节的10都是标志位,剩下的空间正好可以表示汉字。

  以此类推:
        四个字节:11110**** 10****** 10****** 10******
  五个字节:111110*** 10****** 10****** 10****** 10****** 
  六个字节:1111110** 10****** 10****** 10****** 10****** 10****** 
  .............................................
       ..............................................

明白了没有?
编码的方法是从低位到高位

现在就让我们来看看实例吧!

 黄色为标志位
其它着色为了显示其,编码后的位置

Unicode十六进制

Unicode二进制

UTF8二进制

UTF8十六进制

UTF8字节数

B

00001011

00001010

B

1

9D

00010011101

11000010 10011101

C2 9D

2

A89E

10101000 10011110

11101010 10100010 10011110

EA A2 9E

3

分享到:
评论

相关推荐

    unicode/utf8 and unicode/utf16(c代码)

    收藏的基于C编写的utf8<->unicode和utf16<->unicode相互之间的转换。

    XE5 And Upper AES Ansi Unicode UTF8

    XE5 And Upper AES Ansi Unicode UTF8 手机平台 AES 加密解密

    ASP版hmac和md5加密函数,支持中文,带unicode和utf8转码

    原因还是ASP转换unicode十分困难,但是支付宝的接口是有这么个函数的,而且代码很简练,我参考它设计了UTF-8编码函数。(支持多国语言的哦) 经过三天的改进,程序从17K精简到了9K,使hmac兼容中文,md5也提供了两...

    delphi Read and Write Unicode

    Very Good Write and Read Utf8 ,Unicode

    utf - 8和unicode字符「utf-8 and unicode characters」-crx插件

    ☞utf8字符是一个扩展,其中包含很酷的字符列表☆✪✰复制并粘贴到使用utf的网站上的消息❀中,例如twitter或wordpress:smiling_face:✓➩I:red_heart::hot_beverage::umbrella_with_rain_drops::smiling_face:☻中...

    freeradius 默认生成表转utf8

    Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '= 所以,自己把还没产生数据的表都删除,重新建立ut8表 此文件sql语句都是没产生数据的表,直接可以用py或者...

    utf8 encoding

    This PDF file is an excerpt from The Unicode Standard, Version 5.2, issued and published by the Uni- code Consortium. The PDF files have not been modified to reflect the corrections found on the ...

    rfc2044-utf-8.pdf

    UTF-8, a transformation format of Unicode and ISO 10646 UTF-8, 规范

    处理UTF-8格式字符串的便携PHP库.zip

    }UTF-8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,又称万国码。由Ken Thompson于1992年创建。现在已经标准化为RFC 3629。UTF-8用1到6个字节编码Unicode字符。用在网页上可以统一...

    Ruby 与编码

    Ruby 与编码 常见问题 * 乱码的问题 * Ruby 中 Unable to convert "\x89" from ASCII-8BIT to UTF8 *Incompatible character encodings: ASCII- 8BIT and UTF-8

    文本查找替换工具.exe

    目前市面的查找替换工具,都是ANSI编码的,对于UNICODE UTF-8 UTF-16编码的文本类文件,查找不到里面的内容。因此直接开发一个支持此编码的文本类查找替换工具。

    UTF-8编码第1/2页

    参考文档:http://www.linuxforum.net/books/UTF-8-Unicode.html ... Function UTF8EncodeChar(z) Dim c : c=AscW(z)’取UNICODE编码 if c>0 And c<256 Then’Asc编码直接返回 UTF8EncodeCh

    unicode字符集转换函数

    UnicodeConv 3.0.0 Unicode Converter Library 3.0.0 Delphi 3/4/5/6/7 and Kylix Implementation

    UniRed(Unicode编码编辑工具)v2.05绿色免费版

    UniRed是一款简单好用的Unicode编码编辑工具。该款工具采用html风格转义符,具有句法高亮、正则表达式全文检索,支持多种语言,... 编码支持: 16-bit Unicode (little endian and big endian); UTF-8; Windows syst

    LuaUnicode icu-lua

    This may affect your ability to do non-binary file input and output of Unicode strings in formats other than UTF-8. UTF-8 strings will probably be safe because UTF-8 does not use control characters ...

    RAPWare Components 5 For D2009

    With Easy XML you get a CodeGear and W3C DOM compliant XML parser that can handle ASCII, UniCode and UTF-8 XML documents. The DOM Documents are generated by our SAX parser. This SAX parser is ...

    MiniGUI V1.6.10

    * It only provides support for the following char sets/encodings: ISO8859-1, GB2312, BIG5, UNICODE UTF-8, and UNICODE UTF-16. samples-1.6.10.tar.gz 需要自已去sourceforge下载 本人将他传到CSDN做备份....

    EMS Advanced Data Export VCL v4.9.0.1 Full Source

    Manually preset text encoding for exported data (UTF-8, UTF-16/UCS-2, UTF-32/UCS-4, Latin1, Latin2, Latin5, Latin7 and more) Saving data for future viewing, modification, printing or web publication ...

    C语言中字符和字符串处理(ANSI字符和Unicode字符)

    我们知道,C语言用char数据类型表示一个8位的ANSI字符,默认在代码中声明一个字符串时,C...Microsoft的C/C++编译器定义了一个内建的数据类型wchar_t,它表示一个16位的Unicode(UTF-16)字符。编译器只有指定了/Zc:wch

    YuPcre2 v1.2.0 for D7-XE10 字符 编码 匹配算法

    It directly supports UnicodeString, AnsiString, or UCS4String, as well as UTF-8, and UTF-16. YuPcre2 provides two matching algorithms, the standard Perl and alternative DFA algorithm: The Perl ...

Global site tag (gtag.js) - Google Analytics