`
Joard
  • 浏览: 28124 次
  • 性别: Icon_minigender_1
  • 来自: 重庆
最近访客 更多访客>>
文章分类
社区版块
存档分类
最新评论

UTF-8 and GBK

阅读更多

Unicode Byte1 Byte2 Byte3 Byte4 example
U+0000-U+007F 0xxxxxxx


'$' U+0024
→ 00100100
→ 0x24
U+0080-U+07FF 110yyyxx 10xxxxxx

'¢' U+00A2
→ 11000010,10100010
→ 0xC2,0xA2
U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx
'€' U+20AC
→ 11100010,10000010,10101100
→ 0xE2,0x82,0xAC
U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx  U+10ABCD
→ 11110100,10001010,10101111,10001101
→ 0xF4,0x8A,0xAF,0x8D

binaryhexdecimal notes
00000000-01111111 00-7F 0-127 US-ASCII (single byte)
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point <= 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification

UTF-8理论上最大支持六字节的字节组。并且二进制码以10开头的字节都是后续字节,而非头字节。



GBK Encoding Ranges range byte 1 byte 2 code points characters GB 18030 GBK 1.0 Codepage 936 GB 2312 total:

23,940 21,897 21,886 21,791 7,445
Level GBK/1 A1–A9 A1–FE 846 728 717 702 682
Level GBK/2 B0–F7 A1–FE 6,768 6,763 6,763 6,763
Level GBK/3 81–A0 40–FE except 7F 6,080 6,080 6,080
Level GBK/4 AA–FE 40–A0 except 7F 8,160 8,160 8,080
Level GBK/5 A8–A9 40–A0 except 7F 192 166 166
user-defined AA–AF A1–FE 564
user-defined F8–FE A1–FE 658
user-defined A1–A7 40–A0 except 7F 672
在GBK里不会存在以80或FF开头的字节组

因此
“一下”
GBK: D2 BB CF C2
UTF-8:E4 B8 80 E4 B8 8B
将UTF-8的字节组按GBK解码时
E4 B8虽然不是汉字,但也是合法的GBK编码
80 无效GBK码,会被转为3F(显示为?号)
E4 B8 同上
8B 非有效单字节GBK(00-7F),会被转为3F
0
0
分享到:
评论

相关推荐

    myEclipse乱码解决办法

    一、将整个project设置编码UTF-8(UTF-8可以最大的支持国际化) windows-&gt;Preferences-&gt;general-&gt;Workspace-&gt;Text file encoding-&gt;Other框中的Text file encoding改为UTF-8。 二、对java源文件编码设置为UTF-8. ...

    Java文件处理工具类--FileUtil

    FIXME How to judge UTF-8 and GBK, the * correct code should be: FileReader fr = new FileReader(new * InputStreamReader(fileName, "ENCODING")); Might let the user select the * encoding would be ...

    商业源码-编程源码-DEDECMS v5.X FOR UCv1.0 UTF8 and GBK(通行证修正).zip

    商业源码-编程源码-DEDECMS v5.X FOR UCv1.0 UTF8 and GBK(通行证修正).zip

    vimrc带详细说明配置文件and插件包

    set fencs=usc-bom,utf-8,gb18030,gbk,gb2312,big5,cp936,euc-jp,euc-kr,latin1, set nocompatible source $vimruntime/vimrc_example.vim source $vimruntime/mswin.vim behave mswin "切换提示语言(解决调试窗口...

    Python 面试题汇总及答案详解完整版

    13 :ascii、unicode、utf-8、gbk 区别 14:字节码和机器码的区别 15:三元运算写法和应用场景 16:Python3 和 Python2 的区别 17:用一行代码实现数值交换 18:Python3 和 Python2 中 int 和 long 区别 19:xrange 和...

    WERAnalysis.tar.gz

    支持utf-8编码和GBK编码。 1 ./WERAnalysis.out filter /home/support/deploy/test_input/standard_answer.txt /home/support/deploy/test_input/recognize_input.txt /home/support/deploy/test_output/ a.比对的...

    快速解决pandas.read_csv()乱码的问题

    1.设置encoding=’gbk’或者encoding=’utf-8’。pandas.read_csv(‘data.csv’,encoding=’gbk’) 2.如果设置encoding直接报错的话 解决方法是:用记事本打开csv文件,另存为设置编码为utf-8,然后重新读取文件设置...

    MADEDIT 多标签

    Unicode(UTF-8, UTF-16/32 with Little or Big Endian), Big5, GBK and S-JIS etc. * Supports Unicode CJK Ext-B. * If users input a character that is not supported by current encoding, this character will...

    sublime 编译sass时CompatibilityError

    使用sublime编译sass/scss文件时,出现错误Encoding::CompatibilityError: incompatible character encodings: GBK and UTF-8 本文用以找出错误原因,解决方案

    leetcode题库-dataStructureForC:考研数据结构基础代码C语言实现

    leetcode题库 ...-finput-charset=UTF-8 $infile -o $outfile 或者c文件使用gb2312编码 本仓库创建于2019年6月26日 When I wrote this, only God and I understood what I was doing. Now, only God knows.

    eclipse文件编码设置、转换原理与实用工具

    批量转换文件的二进制编码(用新的文件编码重写文件),如从gbk到utf-8,免除逐个文件全选、复制、右键、属性、改文本文件编码、粘贴、保存之苦(该转换是根据编码设置文件进行转换的,因此更加安全); c.结合上述...

    php强制下载文件函数

    public function down() { $id = $this-&gt;_get('id'); $M = M("downloads");... $filename = iconv('UTF-8','GBK',$data['filename']); $savename = $data['savename']; $myfile = $data[url] ? $data[url

    python pands实现execl转csv 并修改csv指定列的方法

    # -*- coding: utf-8 -*- import os import pandas as pd import numpy as np #from os import sys def appendStr(strs): return "BOQ" + strs def addBOQ(dirs, csv_file): data = pd.read_csv(os.path.join(dirs...

    vim插件打包

    set fileencodings=utf-8,gb2312,gbk,gb18030,latin1,usc-bom,cp936,big5,euc-jp,euc-kr set termencoding=utf-8 set encoding=utf-8 set hls "检索时高亮显示匹配项 "set foldmethod=syntax "代码折叠 "}} "conf ...

    pandas按若干个列的组合条件筛选数据的方法

    # -*- coding: utf-8 -*- """ Created on Wed Nov 29 10:46:31 2017 @author: wq """ import pandas as pd #input.csv是那个大文件,有很多很多行 df1 = pd.read_csv(u'input.csv', encoding='gbk') #加encoding=...

    Linux系统unzip解压后中文名乱码解决方法.docx

    # -*- coding: utf-8 -*- import os import sys import zipfile print("Processing File " + sys.argv[1]) file = zipfile.ZipFile(sys.argv[1], 'r') for name in file.namelist(): utf8name = name.decode('gbk'...

    PHP常用工具函数小结【移除XSS攻击、UTF8与GBK编码转换等】

    分享给大家供大家参考,具体如下: ... CR(0a) and LF(0b) and TAB(9) are allowed // this prevents some character re-spacing such as // note that you have to handle splits with \n, \r, and \t later sinc

    SublimeText2-文本编辑器-Ubuntu-插件大全

    3.4.2. GBK to UTF8 10 3.4.3. SideBar Enhancements 10 3.4.4. Clipboard History 10 3.4.5. SublimeREPL 10 3.4.6. PlainTasks 10 3.4.7. Open Folder 11 3.4.8. RenameTab 11 3.4.9. Browser Refresh 11 3.4.10. ...

    用pandas按列合并两个文件的实例

    直接上图,图文并茂,相信你很快就知道要干什么。...# -*- coding: utf-8 -*- """ Created on Wed Nov 29 16:02:05 2017 @author: wq """ import pandas as pd df1 = pd.read_csv(u'input.csv', encoding='gbk') df2 =

    PHPExcel压缩包

    $objPHPExcel-&gt;getActiveSheet()-&gt;setCellValue($charlist[$j].($key+1), mb_convert_encoding($k, "UTF-8", "GBK")); $j++; } } $j=0; } foreach($value as $k=&gt;$v){ if($j){ //echo $...

Global site tag (gtag.js) - Google Analytics