linux python等相关编码问题杂

wjm251

浏览: 111211 次
性别:
来自: 沈阳

最近访客更多访客>>

springnet

cubase01

asdf314159265

mimicom

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

linux

Linux Python OS DOS Windows

起因：

python源文件头都设置了coding=cp936
locale设置的是zh_CN.GBK， samba的配置文件中没有与字符集有关的设置
用户通过samba共享放入一个含中文名的文件夹“中国”，然后在linux系统下使用os.path.exists(“中国”)会返回false，但人工去看这个文件确实是存在的。
在shell中用os.listdir(“.”)查看发现这个文件夹处总是'\xe4\xb8\xad\xe5\x9b\xbd'，和shell的locale无关,而我们在windows平台下看“中国”这两个字是这样的：'\xd6\xd0\xb9\xfa'

linux系统无论Lang设置成什么，文件名必须encode成utf-8的str对象，os.path.exists才能正确使用。(当locale是zh_CN.utf-8时，u"路径"方式的也可，可能是内部自己转了)
在java中也一样
String f = "/home/xxx/模板";
File fl = new File(f);
System.out.println(fl.exists());

我写这几句java代码，等编译为class文件后，
如果在控制台把locale设置为zh_CN.utf-8,就可以打出true，如果locale设置为zh_CN.GBK,就不行了，结果是false （我这个文件夹确实是存在的）
结论：
linux系统的“文件名”编码，也有人回复说保存在inode中的文件名是用utf-8编码的，
这个和sys.getdefaultencoding()或sys.getfilesystemencoding()都没关。
处理linux文件路径必须是utf-8格式的，或者为通用使用u"路径"方式，这样前提是要求locale必须是zh_CN.utf-8
-----------经证实，linux中的文件名（或者说保存在某个地方比如inode的文件名）所使用的编码，是由创建文件时的locale影响的。所以上述“模板”这个文件夹是在utf-8的locale下创建的。如果在控制台下操作，还要注意控制台的编码也要设置成和locale一样才能识别正确的编码
javaeye帖子http://www.iteye.com/topic/702140%231559919

mount 时设置参数 codepage是对方机器的编码方式（即所mount机器的代码页），iocharset是本地使用的编码方式（当前linux控制台的locale设置）。通常mount一个windows的FAT文件系统时设置为-o codepage=cp936,iocharset=utf8
--------------想想就知道什么原因了。总之这样设置之后两天都能正常读写。

samba
在配置文件smb.conf中的global段，可做如下设置（来自http://yumi.ztu.edu.ua/docs/samba30/unicode.html）
unix charset
    This is the charset used internally by your operating system. The default is UTF-8, which is fine for most systems, which covers all characters in all languages. The default in previous Samba releases was ASCII.

display charset
    This is the charset Samba will use to print messages on your screen. It should generally be the same as the unix charset.

dos charset
    This is the charset Samba uses when communicating with DOS and Windows 9x/Me clients. It will talk unicode to all newer clients. The default depends on the charsets you have installed on your system. Run testparm -v | grep "dos charset" to see what the default is on your system.

rcsmb restart 重启samba
看来不设置的话就默认了用utf-8（也许和locale有关）了。所以有开头那个问题

python中获得各种编码的方法
import os
import sys
import locale

import codecs
print 'locale.getpreferredencoding():',locale.getpreferredencoding();
print 'codecs.lookup(locale.getpreferredencoding()).name :',codecs.lookup(locale.getpreferredencoding()).name

print 'locale.getdefaultlocale():',locale.getdefaultlocale()
print '系统的缺省编码：sys.getdefaultencoding():', sys.getdefaultencoding()

"""the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used. The result value depends on the operating system"""
print '文件系统的编码：sys.getfilesystemencoding():', sys.getfilesystemencoding()
print '终端的输入编码：sys.stdin.encoding:', sys.stdin.encoding
print '终端的输出编码：sys.stdout.encoding:', sys.stdout.encoding
代码的缺省编码：文件头上# -*- coding: utf-8 -*-

几种语言的对应关系
c++        java        python2.x
char[]或string   byte[]   str
wchar_t[]或wstring String unicode

windows中的MBCS（Muilti-Bytes Charecter Set，多字节字符集）或美国国家标准局（ANSI），在中文简体中是指GBK系，繁体中指big5，MBCS应该重在介绍多字节混排的格式，而非指定必须采用哪种编码

locale模块是Python国际化和本地化支持库的一部分. 他提供一种用于处理那些可能依赖于你用户语言或位置的操作的标准方式. 例如, 货币格式化, 比较字符串以便排序, 处理时间日期. 他没有包含翻译(可参见gettext模块)或Unicode编码.

由于可以在应用程序范围内改变本地化设置, 所以推荐用户避免在库中改变值而是让应用程序一次性设置.

Ulipad和Uliweb作者，CpyUG管理员limodou：
这是当然了的。你随便写的一个包含中文处理的程序，老外拿去运行可能就会有问题。老外写的程序，你用来处理中文可能也会有问题。要想避免就必须进行unicode和环境的判断。比如自动识别utf-8文件，通过locale.getdefaultlocale()得到操作系统当前的编码，通过sys.getdefaultencoding获得python环境的编码，通过sys.getfilesystemencoding获得操作系统文件系统的缺省编码，使用locale.setlocale（locale.setlocale(locale.LC_ALL,"zh_CN.UTF-8")）来设置编码，等等手段来判断你的环境。
中文化，国际化不是件简单的事情，本来就不简单。使用utf-8之类的国际通用编码才能真正简化，但是你仍然无法保证使用者与你的环境相同，不然你的用户群就会比较小了。

分享到：