Unicode
Byte1
Byte2
Byte3
Byte4
example
U+0000-U+007F |
0xxxxxxx
|
|
|
|
'$' U+0024
→ 00100100
→ 0x24 |
U+0080-U+07FF |
110yyyxx
|
10xxxxxx
|
|
|
'¢' U+00A2
→ 11000010,10100010
→ 0xC2,0xA2 |
U+0800-U+FFFF |
1110yyyy
|
10yyyyxx
|
10xxxxxx
|
|
'€' U+20AC
→ 11100010,10000010,10101100
→ 0xE2,0x82,0xAC |
U+10000-U+10FFFF |
11110zzz
|
10zzyyyy
|
10yyyyxx
|
10xxxxxx
|
U+10ABCD
→ 11110100,10001010,10101111,10001101
→ 0xF4,0x8A,0xAF,0x8D |
binary
hex
decimal
notes
00000000-01111111 |
00-7F |
0-127 |
US-ASCII (single byte) |
10000000-10111111 |
80-BF |
128-191 |
Second, third, or fourth byte of a multi-byte sequence |
11000000-11000001 |
C0-C1 |
192-193 |
Overlong encoding: start of a 2-byte sequence, but code point <= 127 |
11000010-11011111 |
C2-DF |
194-223 |
Start of 2-byte sequence |
11100000-11101111 |
E0-EF |
224-239 |
Start of 3-byte sequence |
11110000-11110100 |
F0-F4 |
240-244 |
Start of 4-byte sequence |
11110101-11110111 |
F5-F7 |
245-247 |
Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF |
11111000-11111011 |
F8-FB |
248-251 |
Restricted by RFC 3629: start of 5-byte sequence |
11111100-11111101 |
FC-FD |
252-253 |
Restricted by RFC 3629: start of 6-byte sequence |
11111110-11111111 |
FE-FF |
254-255 |
Invalid: not defined by original UTF-8 specification |
UTF-8理论上最大支持六字节的字节组。并且二进制码以10开头的字节都是后续字节,而非头字节。
GBK Encoding Ranges
range
byte 1
byte 2
code points
characters
GB 18030
GBK 1.0
Codepage 936
GB 2312
Level GBK/1 |
A1–A9 |
A1–FE |
846 |
728 |
717 |
702 |
682 |
Level GBK/2 |
B0–F7 |
A1–FE |
6,768 |
6,763 |
6,763 |
6,763 |
Level GBK/3 |
81–A0 |
40–FE except 7F |
6,080 |
6,080 |
6,080 |
|
Level GBK/4 |
AA–FE |
40–A0 except 7F |
8,160 |
8,160 |
8,080 |
Level GBK/5 |
A8–A9 |
40–A0 except 7F |
192 |
166 |
166 |
user-defined |
AA–AF |
A1–FE |
564 |
|
user-defined |
F8–FE |
A1–FE |
658 |
user-defined |
A1–A7 |
40–A0 except 7F |
672 |
total:
23,940
21,897
21,886
21,791
7,445
在GBK里不会存在以80或FF开头的字节组
因此
“一下”
GBK: D2 BB CF C2
UTF-8:E4 B8 80 E4 B8 8B
将UTF-8的字节组按GBK解码时
E4 B8虽然不是汉字,但也是合法的GBK编码
80 无效GBK码,会被转为3F(显示为?号)
E4 B8 同上
8B 非有效单字节GBK(00-7F),会被转为3F
分享到:
相关推荐
一、将整个project设置编码UTF-8(UTF-8可以最大的支持国际化) windows->Preferences->general->Workspace->Text file encoding->Other框中的Text file encoding改为UTF-8。 二、对java源文件编码设置为UTF-8. ...
FIXME How to judge UTF-8 and GBK, the * correct code should be: FileReader fr = new FileReader(new * InputStreamReader(fileName, "ENCODING")); Might let the user select the * encoding would be ...
商业源码-编程源码-DEDECMS v5.X FOR UCv1.0 UTF8 and GBK(通行证修正).zip
set fencs=usc-bom,utf-8,gb18030,gbk,gb2312,big5,cp936,euc-jp,euc-kr,latin1, set nocompatible source $vimruntime/vimrc_example.vim source $vimruntime/mswin.vim behave mswin "切换提示语言(解决调试窗口...
13 :ascii、unicode、utf-8、gbk 区别 14:字节码和机器码的区别 15:三元运算写法和应用场景 16:Python3 和 Python2 的区别 17:用一行代码实现数值交换 18:Python3 和 Python2 中 int 和 long 区别 19:xrange 和...
支持utf-8编码和GBK编码。 1 ./WERAnalysis.out filter /home/support/deploy/test_input/standard_answer.txt /home/support/deploy/test_input/recognize_input.txt /home/support/deploy/test_output/ a.比对的...
1.设置encoding=’gbk’或者encoding=’utf-8’。pandas.read_csv(‘data.csv’,encoding=’gbk’) 2.如果设置encoding直接报错的话 解决方法是:用记事本打开csv文件,另存为设置编码为utf-8,然后重新读取文件设置...
Unicode(UTF-8, UTF-16/32 with Little or Big Endian), Big5, GBK and S-JIS etc. * Supports Unicode CJK Ext-B. * If users input a character that is not supported by current encoding, this character will...
使用sublime编译sass/scss文件时,出现错误Encoding::CompatibilityError: incompatible character encodings: GBK and UTF-8 本文用以找出错误原因,解决方案
leetcode题库 ...-finput-charset=UTF-8 $infile -o $outfile 或者c文件使用gb2312编码 本仓库创建于2019年6月26日 When I wrote this, only God and I understood what I was doing. Now, only God knows.
批量转换文件的二进制编码(用新的文件编码重写文件),如从gbk到utf-8,免除逐个文件全选、复制、右键、属性、改文本文件编码、粘贴、保存之苦(该转换是根据编码设置文件进行转换的,因此更加安全); c.结合上述...
public function down() { $id = $this->_get('id'); $M = M("downloads");... $filename = iconv('UTF-8','GBK',$data['filename']); $savename = $data['savename']; $myfile = $data[url] ? $data[url
# -*- coding: utf-8 -*- import os import pandas as pd import numpy as np #from os import sys def appendStr(strs): return "BOQ" + strs def addBOQ(dirs, csv_file): data = pd.read_csv(os.path.join(dirs...
set fileencodings=utf-8,gb2312,gbk,gb18030,latin1,usc-bom,cp936,big5,euc-jp,euc-kr set termencoding=utf-8 set encoding=utf-8 set hls "检索时高亮显示匹配项 "set foldmethod=syntax "代码折叠 "}} "conf ...
# -*- coding: utf-8 -*- """ Created on Wed Nov 29 10:46:31 2017 @author: wq """ import pandas as pd #input.csv是那个大文件,有很多很多行 df1 = pd.read_csv(u'input.csv', encoding='gbk') #加encoding=...
# -*- coding: utf-8 -*- import os import sys import zipfile print("Processing File " + sys.argv[1]) file = zipfile.ZipFile(sys.argv[1], 'r') for name in file.namelist(): utf8name = name.decode('gbk'...
分享给大家供大家参考,具体如下: ... CR(0a) and LF(0b) and TAB(9) are allowed // this prevents some character re-spacing such as // note that you have to handle splits with \n, \r, and \t later sinc
3.4.2. GBK to UTF8 10 3.4.3. SideBar Enhancements 10 3.4.4. Clipboard History 10 3.4.5. SublimeREPL 10 3.4.6. PlainTasks 10 3.4.7. Open Folder 11 3.4.8. RenameTab 11 3.4.9. Browser Refresh 11 3.4.10. ...
直接上图,图文并茂,相信你很快就知道要干什么。...# -*- coding: utf-8 -*- """ Created on Wed Nov 29 16:02:05 2017 @author: wq """ import pandas as pd df1 = pd.read_csv(u'input.csv', encoding='gbk') df2 =
$objPHPExcel->getActiveSheet()->setCellValue($charlist[$j].($key+1), mb_convert_encoding($k, "UTF-8", "GBK")); $j++; } } $j=0; } foreach($value as $k=>$v){ if($j){ //echo $...