`
aigo
  • 浏览: 2537990 次
  • 性别: Icon_minigender_1
  • 来自: 宜昌
社区版块
存档分类
最新评论

C++ fopen、CFile如何以UTF-8编码格式读写文件

阅读更多

 

How to write UTF-8 file with fprintf in C++

http://stackoverflow.com/questions/10028750/how-to-write-utf-8-file-with-fprintf-in-c

 

ou shouldn't need to set your locale or set any special modes on the file if you just want to use fprintf. You simply have to use UTF-8 encoded strings.

#include <cstdio>
#include <codecvt>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
    std::string utf8_string = convert.to_bytes(L"кошка 日本国");

    if(FILE *f = fopen("tmp","w"))
    fprintf(f,"%s\n",utf8_string.c_str());
}

 

Save the program as UTF-8 with signature or UTF-16 (i.e. don't use UTF-8 without signature, otherwise VS won't produce the right string literal). The file written by the program will contain the UTF-8 version of that string. Or you can do:

int main() {
    if(FILE *f = fopen("tmp","w"))
        fprintf(f,"%s\n","кошка 日本国");
}

 

In this case you must save the file as UTF-8 without signature, because you want the compiler to think the source encoding is the same as the execution encoding... This is a bit of a hack that relies on the compiler's, IMO, broken behavior.

You can do basically the same thing with any of the other APIs for writing narrow characters to a file, but note that none of these methods work for writing UTF-8 to the Windows console. Because the C runtime and/or the console is a bit broken you can only write UTF-8 directly to the console by doing SetConsoleOutputCP(65001) and then using one of the puts variety of function.

If you want to use wide characters instead of narrow characters then locale based methods and setting modes on file descriptors could come into play.

#include <cstdio>
#include <fcntl.h>
#include <io.h>

int main() {
    if(FILE *f = fopen("tmp","w")) {
        _setmode(_fileno(f), _O_U8TEXT);
        fwprintf(f,L"%s\n",L"кошка 日本国");
    }
}

 

#include <fstream>
#include <codecvt>

int main() {
    if(auto f = std::wofstream("tmp")) {
        f.imbue(std::locale(std::locale(),
                new std::codecvt_utf8_utf16<wchar_t>)); // assumes wchar_t is UTF-16
        f << L"кошка 日本国\n";
    }
}

 

The first example uses wstring_convert from C++11, but any other method of obtaining a UTF-8 encoding works too, e.g. WideCharToMultiByte. The last example uses a C++11 codecvt facet for which there's not a built-in, pre-c++11 replacement. The other two examples don't use C++11. 

 

How to Read/Write UTF8 text files in C?

http://stackoverflow.com/questions/21737906/how-to-read-write-utf8-text-files-in-c

Instead of

fprintf(fout,"%c ",character);

 

use

fprintf(fout,"%c",character);

 

The second fprintf() does not contain a space after %c which is what was causing out.txt to display weird characters. The reason is that fgetc() is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.

putchar(character) output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try

while((character=fgetc(fin))!=EOF){
    putchar(character);
    printf(" "); // This mimics what you are doing when you write to out.txt
    fprintf(fout,"%c ",character);
}

 

If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.

#include <stdio.h>
#include <stdlib.h>

/* The first byte of a UTF-8 character
 * indicates how many bytes are in
 * the character, so only check that
 */
int numberOfBytesInChar(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

int main(){
    FILE *fin;
    FILE *fout;
    int character;
    fin = fopen("in.txt", "r");
    fout = fopen("out.txt","w");
    while( (character = fgetc(fin)) != EOF) {
        for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
            putchar(character);
            fprintf(fout, "%c", character);
            character = fgetc(fin);
        }
        putchar(character);
        printf(" ");
        fprintf(fout, "%c ", character);
    }
    fclose(fin);
    fclose(fout);
    printf("\nFile has been created...\n");
    return 0;
}

 

UTF-8, CString and CFile? (C++, MFC)

http://stackoverflow.com/questions/2318481/utf-8-cstring-and-cfile-c-mfc

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read thishttp://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

  • The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
  • The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where Windicates wide characters).
  • The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
  • Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTRis always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

 

 

分享到:
评论

相关推荐

    c++ fopen&#40;&#41;函数应用

    比较完整的一个fopen&#40;&#41;函数解释,适合刚学c++的人,希望能帮上点忙,这是个不错的资源共享的地方

    C++ fopen 简单读写文件

    C++ 使用fopen 简单读写文件 源码+测试

    php使用fopen创建utf8编码文件的方法

    主要介绍了php使用fopen创建utf8编码文件的方法,是涉及编码问题需要注意的一个技巧,需要的朋友可以参考下

    PHP生成UTF8文件的方法

    //”\\xEF\\xBB\\xBF”,这串字符不可缺少,生成的文件将成为UTF-8格式,否则依然是ANSI格式。 fputs($f, $text); //写入。 fclose($f); ?&gt; 您可能感兴趣的文章:php 判断网页是否是utf8编码的方法php字符编码...

    解决C++ fopen按行读取文件及所读取的数据问题

    1、已有文本文件: string dataList; 使用fopen读取: FILE *fpListFile = fopen&#40;dataList.c_str(&#41;, "r"); if (!fpListFile){ cout &lt;&lt; "0.can't open " &lt;&lt; dataList &lt;&lt; endl; return...

    NameCMS**展示出售系统 v1.0126 beta UTF-8.rar

    NameCMS是一套免费的开源的适用于个人米农的的**展示和出售系统,它能很好的将米农的**独立的展现出来,达到搜索...压缩包内需要修改图片的,都已经给大家制作好了PSD格式,直接修改PSD文件替换要修改的图片就好了。

    C++文件读写+二进制读写+STL文件函数+创建文件+读指针+写指针+读写指针+可应用于系统中底层的文件创建+计算机专业领域

    指定教材:《新标准C++程序设计教程》郭炜 编著清华大学出版社文件读写文本文件内容是文字(哪国语言都行),用记事本打开能看到文字的文件。 二进制文件本质上所有文件都是0,1串,因此都是二进制文件。但是一般...

    fopen独占方式操作文件

    C API fopen打开的文件无法以独占方式操作文件,此代码利用Windows API巧妙的解决了该问题

    C语言源代码格式化 完工 V1.04 20120226 1946.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C++智能指针:shared-ptr用法详解.pdf

    C++智能指针:shared_ptr⽤法详解 C++智能指针:shared_ptr⽤法详解 shared_ptr是C++11⾥的新特性,其包装了new操作符在堆上分配的动态对象。如: shared_ptr&lt;int&gt; sp1(new int(100)); //相当于 //int *sp1=new int...

    C语言源代码格式化 完工 V1.03 20120112 1536.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 V1.05 20120229 1804.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 V1.08 20120801 1627.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 V1.09 20120821 2116.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 小文版本 V1.10 20120831 0955.zip

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    PHP建站帮手采集器 v1.0

    可采集utf-8编码网站,实现gb和utf8互换; 可以可行开发导入目标库; 加强空标题空内容、重复地址、不良关键字的过滤 支持单项采集与全自动采集,可自行修改采集数量。 功能介绍: 1、支持文章内容分页采集; 2、...

    fopen无法读取文件

    fopen无法读取文件_fopen不好用_fopen函数失败_fopen空指针_fopen错误指针_vc_mfc_vs2013

    C++获取jpg和png图像的宽和高

    打开jpeg的文件流,根据jpeg的文件格式,用跳段的方式查找文件流中的标识符,速度快。在其他开发者的版本上进行了修改和完善,能处理更多的jpeg格式,并且更健壮,并加入了png格式的处理。

    matlab中fopen的a源码-hemokit:受Emokit代码启发的EskEEG的Haskell库

    matlab中fopen的a源码血药盒 受代码启发,适用于Epoc EEG的Haskell库和工具套件。 目前仅在Linux和Windows上有效-欢迎使用其他平台的补丁,它们应该微不足道。 下载 您可以下载或通过自己构建它。 图书馆特色 通过...

    浅谈php中fopen不能创建中文文件名文件的问题

    之前网页的chartset用的是utf-8,文件也用utf-8,然后用fopen&#40;&#41;创建一个中文文件名的文件时问题就出来了,文件名都是乱 码! 查看了很多文档试了不少方法都解决不了,本来想着用别的方法绕过这个问题,忽然...

Global site tag (gtag.js) - Google Analytics