【Q&A】stl容器去除重复的元素

cloudtech

浏览: 4616214 次
性别:
来自: 武汉

最近访客更多访客>>

u012363178

devcang

robinjim

JasonWo

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (6378)

社区版块

存档分类

2013-03 ( 15)
2013-02 ( 44)
2013-01 ( 243)
更多存档...

经常会有这种情况，例如有几个不同来源的词表，先后放入到一个容器中（如vector中），要求去除容器中重复的词条。通常就是两步：1. 排序；2. 去除相邻的重复节点。对于2，我从前都是用文本编辑器（ultraedit）去重，没有用程序做过。现在写了下，主要是理解unique函数。关键代码如下：

// 1. sort the items
sort (ItemVec.begin(), ItemVec.end());
// 2. remove the duplicated items adjacent
ItemVec.erase (unique (ItemVec.begin(), ItemVec.end()), ItemVec.end());

unique函数干了这样一件事情：它将容器中的元素忽略掉重复的部分，将后面的元素copy到前面（就地操作），并返回修改后序列的Last位置。时间复杂度是O（n）——只遍历一遍；而且就地操作，没有用额外存储空间。看了下它的实现源代码，如下：

template<class _FwdIt> inline
_FwdIt _Unique(_FwdIt _First, _FwdIt _Last)
{ // remove each matching previous
_DEBUG_RANGE(_First, _Last);
for (_FwdIt _Firstb; (_Firstb = _First) != _Last && ++_First != _Last; )
if (*_Firstb == *_First)
{ // copy down
for (; ++_First != _Last; )
if (!(*_Firstb == *_First))
*++_Firstb = *_First;
return (++_Firstb);
}
return (_Last);
}

上面代码中_First是当前的指针，_Firstb是指前一个指针，b表示before。有意思的是，上面有两个嵌套的for循环，不过细细分析，应该是只遍历一次。还有比较费心思的地方是各种“++”的使用，即在什么时候移动指针。真是写库函数啊，能省一行是一行；不过牺牲了点可读性。

像以往一样，还是要实地测试一下。测试函数读取的文件如下：

abc
aa
ab
sa
aa
fe
asf
ab
az
de
sf

其中有两处重复，分别是“aa”和“ab”。测试代码如下：

bool RmDupItemTxt (const char* sFileIn, const char* sFileOut)
{
ifstream in (sFileIn);
ofstream out (sFileOut);
if (!in || !out)
{
cerr << "Can not open the file" << endl;
return false;
}

// load the content
vector<string> ItemVec;
string sItem;
while (in >> sItem)
ItemVec.push_back (sItem);

// sort and remove the duplicated items
sort (ItemVec.begin(), ItemVec.end());
ItemVec.erase (unique (ItemVec.begin(), ItemVec.end()), ItemVec.end());

// dump it into txt file
copy (ItemVec.begin(), ItemVec.end(), ostream_iterator<string> (out, "\n"));

return true;
}

最后产生的结果文件如下：

aa
ab
abc
asf
az
de
fe
sa
sf

可以看到结果正确。

分享到：

【重新上本科】再写main函数 | [Hadoop源码解读]（四）MapReduce篇之Co ...

2012-08-24 22:54
浏览 767
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论