`

shell中的文件分类、合并和切割

 
阅读更多
sort用法
s o r t命令选项很长,下面仅介绍各种选项。
选项
s o r t命令的一般格式为:
                                                                                                                                                                                                                                                                          
sort -cmu -o output_file [other options] +pos1 +pos2 input_files下面简要介绍一下s o r t的参数:
                                                                                                                -c 测试文件是否已经分类。
-m 合并两个分类文件。
-u 删除所有复制行。
-o 存储s o r t结果的输出文件名。其他选项有:
                                                                                                                -b 使用域进行分类时,忽略第一个空格。
-n 指定分类是域上的数字分类。
-t 域分隔符;用非空格或t a b键分隔域。
-r 对分类次序或比较求逆。
+n n为域号。使用此域号开始分类。
n n为域号。在分类比较时忽略此域,一般与+ n一起使用。
post1 传递到m,n。m为域号,n为开始分类字符数;例如4,6意即以第5域分类,从第7个字符开始。保存输出
- o选项保存分类结果,然而也可以使用重定向方法保存。下面例子保存结果到r e s u l t s . o u t:
                                                                                                                                                                                                                                                                          
$sort video.txt >results.out启动方式
缺省情况下, s o r t认为一个空格或一系列空格为分隔符。要加入其他方式分隔,使用- t选s o r t执行时,先查看是否为域分隔设置了- t选项,如果设置了,则使用它来将记录分隔成域0、域1等等
;如果未设置,用空格代替。缺省时s o r t将整个行排序,指定域号的情况例外。
下面是文件v i d e o . t x t的清单,包含了上个季度家电商场的租金情况。各域为:(1)名称,(2)供货区代码,(3)本季度租金,(4)本年租金。域分隔符为冒号。为此对此例需使用‘ - t’选项。文件如下:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat video.txt
Boys in Company C:HK:192:2192
Alien:HK:119:1982
The Hill:KL:63:2972
Aliens:HK:532:4892
Star Wars:HK:301:4102
A Few Good Men:KL:445:5851
Toy Story:HK:239:3972sort对域的参照方式
关于s o r t的一个重要事实是它参照第一个域作为域0,域1是第二个域,等等。s o r t也可以使用整行作为分类依据。
文件是否已分类
怎样分辨文件是否已分类?如果只有3 0行,看看就知道了,但如果是4 0 0行呢,使用s o r t - c通知s o r t文件是否按某种顺序分类。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -c video.txt
sort: video.txt:2: disorder: Alien:HK:119:1982结果显示未分类,
现在将之分类,再试一次:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t:  video.txt >video2.txt
[sam@Linux_chenwy sam]$ sort -c video2.txt
[sam@Linux_chenwy sam]$返回提示符表明已分类。然而如果测试成功,返回一个信息行会更好。
基本sort
最基本的s o r t方式为sort filename,按第一域进行分类(分类键0)。实际上读文件时s o r t操作将行中各域进行比较,这里返回基于第一域s o r t的结果
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: video.txt
A Few Good Men:KL:445:5851
Alien:HK:119:1982
Aliens:HK:532:4892
Boys in Company C:HK:192:2192
Star Wars:HK:301:4102
The Hill:KL:63:2972
Toy Story:HK:239:3972sort分类求逆
如果要逆向s o r t结果,使用- r选项。在通读大的注册文件时,使用逆向s o r t很方便。下面是按域0分类的逆向结果。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -r video.txt
Toy Story:HK:239:3972
The Hill:KL:63:2972
Star Wars:HK:301:4102
Boys in Company C:HK:192:2192
Aliens:HK:532:4892
Alien:HK:119:1982
A Few Good Men:KL:445:5851按指定域分类
有时需要只按第2域(分类键1)分类。这里为重排报文中供应区代码,使用t 1,意义为按分类键1分类。下面的例子中,所有供应区代码按分类键1分类;注意分类键2和3对应各域也被分类。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: +1 video.txt
Alien:HK:119:1982
Boys in Company C:HK:192:2192
Toy Story:HK:239:3972
Star Wars:HK:301:4102
Aliens:HK:532:4892
A Few Good Men:KL:445:5851
The Hill:KL:63:2972前几个第二域都是HK,第三域:119,192,301,489,532,63,按第一个数字分了,因此必须指定多域及数值域
数值域分类
依此类推,要按第三分类键分类,使用t 3。但是因为这是数值域,即为数值分类,可以使用- n选项。下面例子为按年租金分类命令及结果:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: +3n video.txt
Alien:HK:119:1982
Boys in Company C:HK:192:2192
The Hill:KL:63:2972
Toy Story:HK:239:3972
Star Wars:HK:301:4102
Aliens:HK:532:4892
A Few Good Men:KL:445:5851如果不指定n,如下
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: +2 video.txt
Alien:HK:119:1982
Boys in Company C:HK:192:2192
Toy Story:HK:239:3972
Star Wars:HK:301:4102
A Few Good Men:KL:445:5851
Aliens:HK:532:4892
The Hill:KL:63:2972o r t只查看第3域每个数值的第一个数,并按其分类,然后再按第二个数依次下去。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: +2n video.txt
The Hill:KL:63:2972
Alien:HK:119:1982
Boys in Company C:HK:192:2192
Toy Story:HK:239:3972
Star Wars:HK:301:4102
A Few Good Men:KL:445:5851
Aliens:HK:532:4892数值域倒序:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: +2nr video.txt
Aliens:HK:532:4892
A Few Good Men:KL:445:5851
Star Wars:HK:301:4102
Toy Story:HK:239:3972
Boys in Company C:HK:192:2192
Alien:HK:119:1982
The Hill:KL:63:2972唯一性分类
有时,原文件中有重复行,这时可以使用- u选项进行唯一性(不重复)分类以去除重复行,本例中A l i e n有相同的两行。带重复行的文件如下,其中A l i e n插入了两次:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ echo "Aliens:HK:532:4892" >> video.txt
[sam@Linux_chenwy sam]$ cat video.txt
Boys in Company C:HK:192:2192
Alien:HK:119:1982
The Hill:KL:63:2972
Aliens:HK:532:4892
Star Wars:HK:301:4102
A Few Good Men:KL:445:5851
Toy Story:HK:239:3972
Aliens:HK:532:4892使用- u选项去除重复行,不必加其他选项, s o r t会自动处理。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -u video.txt
A Few Good Men:KL:445:5851
Alien:HK:119:1982
Aliens:HK:532:4892
Boys in Company C:HK:192:2192
Star Wars:HK:301:4102
The Hill:KL:63:2972
Toy Story:HK:239:3972使用k的其他sort方法
s o r t还有另外一些方法指定分类键。可以指定k选项,第1域(分类键)以1开始。不要与前面相混淆。其他选项也可以使用k,主要用于指定分类域开始的字符数目。
使用- k 4,按年租金分类的次序。
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k4 video.txt
A alien:HK:119:1982
Alien:HK:119:1982
Boys in Company C:HK:192:2192
A the Hill:KL:63:2972
The Hill:KL:63:2972
Toy Story:HK:239:3972
Star Wars:HK:301:4102
Aliens:HK:532:4892
Aliens:HK:532:4892
A Few Good Men:KL:445:5851用k做分类键排序
可以指定分类键次序。先以第4域,再以第1域分类,命令为-k4 -k1,也可以反过来,以便在文件首行显示最高年租金,方法如下:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k4 -k1  video.txt
AAlien:HK:119:1982
Alien:HK:119:1982
Boys in Company C:HK:192:2192
The Hill:KL:63:2972
Toy Story:HK:239:3972
Star Wars:HK:301:4102
Aliens:HK:532:4892
A Few Good Men:KL:445:5851
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k4 -k1 -r  video.txt
A Few Good Men:KL:445:5851
Aliens:HK:532:4892
Star Wars:HK:301:4102
Toy Story:HK:239:3972
The Hill:KL:63:2972
Boys in Company C:HK:192:2192
Alien:HK:119:1982
AAlien:HK:119:1982这里-r是对第四域反排序?
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k1  video.txt
AAlien:HK:119:1982
A Few Good Men:KL:445:5851
Alien:HK:119:1982
Aliens:HK:532:4892
Boys in Company C:HK:192:2192
Star Wars:HK:301:4102
The Hill:KL:63:2972
Toy Story:HK:239:3972
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k1 -k4  video.txt
AAlien:HK:119:1982
A Few Good Men:KL:445:5851
Alien:HK:119:1982
Aliens:HK:532:4892
Boys in Company C:HK:192:2192
Star Wars:HK:301:4102
The Hill:KL:63:2972
Toy Story:HK:239:3972
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t: -k1 -k4 -r  video.txt
Toy Story:HK:239:3972
The Hill:KL:63:2972
Star Wars:HK:301:4102
Boys in Company C:HK:192:2192
Aliens:HK:532:4892
Alien:HK:119:1982
A Few Good Men:KL:445:5851
AAlien:HK:119:1982对第一域进行反排序?
换成第3域
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t:  +2nr -k1 -r  video.txt
Aliens:HK:532:4892
A Few Good Men:KL:445:5851
Star Wars:HK:301:4102
Toy Story:HK:239:3972
Boys in Company C:HK:192:2192
Alien:HK:119:1982
AAlien:HK:119:1982
The Hill:KL:63:2972对第三域进行倒序,再对第一域排序,最后把第一域倒序?
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort -t:  +2nr -k1  video.txt
Aliens:HK:532:4892
A Few Good Men:KL:445:5851
Star Wars:HK:301:4102
Toy Story:HK:239:3972
Boys in Company C:HK:192:2192
AAlien:HK:119:1982
Alien:HK:119:1982
The Hill:KL:63:2972指定sort序列
可以指定分类键顺序,也可以使用- n选项指定不使用哪个分类键进行查询。看下面的s o r t命令:
                                                                                                                                                                                                                                                                          
[sam@Linux_chenwy sam]$ sort +0 -2 +3 video.txt该命令意即开始以域0分类,忽略域2,然后再使用域3分类。
pos用法
指定开始分类的域位置的另一种方法是使用如下格式:
                                                                                                                                                                                                                                                                          
sort +field_number.characters_in意即从f i e l d _ n u m b e r开始分类,但是要在此域的第c h a r a c t e r s _ i n个字符开始。
如:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat video.txt
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982
The Hill:KL223:63:2972
Aliens:HK11:532:4892
Star Wars:HK38:301:4102
A Few Good Men:KL87:445:5851
Toy Story:HK65:239:3972要只使用供应区代码后缀部分将文件分类,其命令为+ 1 . 2,意即以第1域最左边第3个字符开始分类
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: +1.2 video.txt
Aliens:HK11:532:4892
The Hill:KL223:63:2972
Star Wars:HK38:301:4102
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982
Toy Story:HK65:239:3972
A Few Good Men:KL87:445:5851比较一下加n,呵呵,其实区码并不需要加n
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: +1.2n video.txt
Aliens:HK11:532:4892
Star Wars:HK38:301:4102
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982
Toy Story:HK65:239:3972
A Few Good Men:KL87:445:5851
通过使用 sort 和 tsort,而不是采取使用 Perl 或 Awk 的较复杂的解决方案,可以节省时间,同时还能避免令人头疼的问题。Jacek Artymiak 将向您说明如何做到这一点。

管可以用 Perl 或 Awk 编写高级排序应用程序,但并非总是有此必要,而且这样的工作也常常令人感到头疼。使用 sort
命令,您同样可以实现您所需的大多数功能,而且更容易,它可以对多个文件中的行进行排序、合并文件甚至可以查看是否有必要对它们进行排序。您可以指定排序
键(用于比较的行中的一部分),也可不指定,后一种情况下 sort 就比较所有行。
因此,如果您想对密码文件进行排序,就可以使用下列命令(请注意,您不能将输出直接发送到输入文件,因为这会破坏输入文件。这就是为何您需要将它发送到临时文件中,然后将该文件重命名为 /etc/passwd 的原因,如下所示)。
1、清单 1. 简单排序
                                                                                                                                                                                                                                                                          
$ su -
# sort /etc/passwd > /etc/passwd-new
# mv /etc/passwd-new /etc/passwd2、有关 sort 和 tsort 的更多信息
通过打开有关排序操作的 GNU 手册页来学习手册页中的内容,或者通过在命令行中输入 man sort 或 man tsort 在新的终端窗口的手册页或信息页中查看这些选项。
如果您想倒转排序的次序,则应当使用 -r 选项。您还可以用 -u 选项来禁止打印相同的行。
3、sort 的一个非常实用的特性是它用字段键进行排序的能力。字
段是一个文本字符串,通过某个字符与其它字段分隔开。例如,/etc/passwd
中的字段是用冒号(:)分隔的。因此,如果愿意的话,您可以按照用户标识、组标识、注释字段、主目录或 shell 对 /etc/passwd
进行排序。要做到这一点,请使用 -t 选项,其后跟着用作分隔符的字符,接着是用作排序键的字段编号,再跟作为键的最后一个字段的编号;
例如,
                                                                                                                                                                                                                                                                          
sort -t : -k 5,5 /etc/passwd按照注释字段对密码文件进行排序,该字段中存储了完整的用户名(如“John Smith”)。

                                                                                                                                                                                                                                                                          
sort -t : -k 3,4 /etc/passwd同时使用用户标识和组标识对同一个文件进行排序。如果您省略了第二个数字,那么 sort 会假定键是从给定的字段开始,一直到每一行的末尾。动手试一试,并观察其中的区别(当数字排序看上去有错时,请添加 -g 选项)。
还要注意的是,空白过渡是缺省的分隔符,因此,如果字段已经用空白字符分隔了,那么您可以省略分隔符,只使用 -t(另注:字段的编号是从 1 开始的)。
5、为了更好地进行控制,您可以使用键和偏移量。偏
移量是用点与键相分隔的,比如在 -k 1.3,5.7 中,表示排序键应当从第 1 个字段的第 3 个字符开始,到第 5 个字段的第 7
个字符结束(偏移量也是从 1 开始编号的)。何时会用得着偏移量呢?嗯,我时常用它来对 Apache
日志进行排序;键和偏移量表示法让我跳过了日期字段。
6、另一个要关注的选项是 -b,它告知 sort 忽略空白字符(空格、跳格等等)并将行中的第一个非空白字符当做是排序键的开始。还有,如果您使用该选项,那么将从第一个非空白字符开始计算偏移量(当字段分隔符不是空白字符,且字段可能包含以空白字符开头的字符串时,这非常有用)。
                                                                                                                可以用下面这些选项来进一步修改排序算法:
-d(只将字母、数字和空白用作排序键)、
-f(关闭大小写区分,认为小写和大写字符是一样的)、
-i(忽略非打印的 ASCII 字符)、
-M(使用三个字母的月份名称缩写:JAN、FEB、MAR … 来对行进行排序)和
-n(只用数字、- 和逗号或另外一个千位分隔符对行进行排序)。
这些选项以及 -b 和 -r 选项可以用作键编号的一部分,
在这种情况下,它们只适用于该键而非全局,其作用就跟在键定义外使用它时一样。以键编号的用法为例,请考虑:
                                                                                                                                                                                                                                                                          
sort -t: -k 4g,4 -k 3gr,3 /etc/passwd这条命令将按照组标识对 passwd 文件进行排序,而在组内按照用户标识进行逆向排序。
7、如果您所使用的键不能用来确定哪一行是在先,那么它也可以解决这类平局问题。增加一个解决平局问题的提示,请添加另一个 -k 选项,让它跟在字段和(可选的)偏移量后面,使用与前面用于定义键相同的表示法;
例如,
                                                                                                                                                                                                                                                                          
sort -k 3.4,4.5 -k 7.3,9.4 /etc/passwd对行进行排序时,使用从第 3 个键的第 4 个字符开始到第 4 个键的第 5 个字符结束的键,然后再采用从第 7 个字段的第 3 个字符到第 9 个字段的第 4 个字符结束的键来解决上述难题。
8、最后一组选项处理输入、输出和临时文件。例
如,-c 选项,当它用于 sort -c                                                                                                                                                                                                                                                                          
cat file1 file2 file3 | sort > outfile或者,可以使用下面这个命令:
                                                                                                                                                                                                                                                                          
sort -m file1 file2 file3 > outfile第二种情况有个条件:在将所有输入文件一起进行 sort -m 之前,每个文件都必须经过排序。这看起来似乎是个不必要的负担,但事实上这加快了工作速度并节约了宝贵的系统资源。对了,别忘了 -m 选项。在这里您可以使用 -u 选项来禁止打印相同的行。
11、如果需要某种更深奥的排序方法,您可能要查看 tsort 命令,该命令对文件执行拓扑排序。拓扑排序和标准 sort 之间的差别如清单 2 所示(您可以从参考资料下载 happybirthday.txt)。
清单 2. 拓扑排序和标准排序之间的差别
                                                                                                                                                                                                                                                                          
$ cat happybirthday.txt
Happy Birthday to You!
Happy Birthday to You!
Happy Birthday Dear Tux!
Happy Birthday to You!
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt
Happy Birthday Dear Tux!
Happy Birthday to You!
Happy Birthday to You!
Happy Birthday to You!
                                                                                                                                                                                                                                                                          
$ tsort happybirthday.txt
Dear
Happy
to
Tux!
Birthday
You!当然,对于 tsort 的使用来说,这并非一个非常有用的演示,只是举例说明了这两个命令输出的不同。
tsort 通常用于解决一种逻辑问题,即必须通过观察到的部分次序预测出整个次序;例如(来自 tsort 信息页中):
                                                                                                                                                                                                                                                                          
tsort 会产生这样的输出
                                                                                                                                                                                                                                                                          
      a
      b
      c
      d
      e
      f

使用head和tail将输出分类

类操作时,不一定要显示整个文件或一页以查看s o r t结果中的第一和最后一行。如果只显示最高年租金,按第4域分类- k
4并求逆,然后使用管道只显示s o r t输出的第一行,此命令为h e a d,可以指定查阅行数。如果只有第一行,则为head -1:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -k4r video.txt
A Few Good Men:KL87:445:5851
Aliens:HK11:532:4892
Star Wars:HK38:301:4102
Toy Story:HK65:239:3972
The Hill:KL223:63:2972
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -k4r video.txt | head -1
A Few Good Men:KL87:445:5851
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -k4r video.txt | head -2
A Few Good Men:KL87:445:5851
Aliens:HK11:532:4892要查阅最低年租金,使用t a i l命令与h e a d命令刚好相反,它显示文件倒数几行。1为倒数一行,2为倒数两行等等。查阅最后一行为tail -1。结合上述的s o r t命令和t a i l命令显示最低年租金:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -k4r video.txt | tail -1
Alien:HK57:119:1982
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -k4r video.txt | tail -2
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982可以使用h e a d或t a i l查阅任何大的文本文件, h e a d用来查阅文件头,基本格式如下:
                                                                                                                                                                                                                                                                          
head [how_many_lines_to_display] file_nameTa i l用来查阅文件尾,基本格式为:
                                                                                                                                                                                                                                                                          
tail [how_many_lines_to_display] file_name如果使用h e a d或t a i l时想省略显示行数,缺省时显示1 0行。
要查阅文件前2 0行:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ head -20 passwd要查阅文件后10行:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ tail -10 passwd
awk使用sort输出结果
对数据分类时,对s o r t结果加一点附加信息很有必要,对其他用户尤其如此。使用a w k可以轻松完成这一功能。比如说采用上面最低租金的例子,需要将s o r t结果管道输出到a w k,不要忘了用冒号作域分隔符,显示提示信息和实际数据。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -r -k4 video.txt |tail -1 | awk -F: '{print "Worst rental", $1, "has been rented" $3}'
Worst rental Alien has been rented119将两个分类文件合并
将文件合并前,它们必须已被分类。合并文件可用于事务处理和任何种类的修改操作。
下面这个例子,因为忘了把两个家电名称加入文件,它们被放在一个单独的文件里,现在将之并入一个文件。分类的合并格式为‘ sort -m sorted_file1 sorted_file2,下面是包含两个新家电名称的文件列表,它已经分类完毕:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat video2.txt
Crimson Tide:134:2031
Die Hard:152:2981使用-m +o。将这个文件并入已存在的分类文件v i d e o . s o r t,要以名称域进行分类,实际上没有必要加入+ o,但为了保险起见,还是加上的好。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t: -m +0 video2.txt video.txt
Boys in Company C:HK48:192:2192
Alien:HK57:119:1982
Crimson Tide:134:2031
Die Hard:152:2981
The Hill:KL223:63:2972
Aliens:HK11:532:4892
Star Wars:HK38:301:4102
A Few Good Men:KL87:445:5851
Toy Story:HK65:239:3972系统sort
s o r t可以用来对/ e t c / p a s s w d文件中用户名进行分类。这里需要以第1域即注册用户名分类,然后管道输出结果到a w k,a w k打印第一域。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat passwd | sort -t: +0 | awk -F: '{print $1}'
adm
apache
bin
chenwy
daemon
desktop
.......s o r t还可以用于d f命令,以递减顺序打印使用列。下面是一般d f输出。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ df
文件系统               1K-块        已用     可用 已用% 挂载点
/dev/sda2              5162828   2289460   2611108  47% /
/dev/sda1               497829     13538    458589   3% /boot
none                     99352         0     99352   0% /dev/shm使用- b选项,忽略分类域前面的空格。使用域4(+ 4),即容量列将分类求逆,最后得出文件系统自由空间的清晰列表。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ df | sort -b -r +4
文件系统               1K-块        已用     可用 已用% 挂载点
/dev/sda2              5162828   2289460   2611108  47% /
/dev/sda1               497829     13538    458589   3% /boot
none                     99352         0     99352   0% /dev/shm在一个文本文件中存入所有I P地址的拷贝,这样查看本机I P地址更容易一些。有时如果管理员权限下,就需要将此文件分类。将I P地址按文件中某种数值次序分类时,需要指定域分隔符为句点。这里只需关心I P地址的最后一段。分类应从此域即域3开始,未分类文件如下:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ vi iplist
[sam@chenwy sam]$ cat iplist
193.132.80.123 dave tansley
193.132.80.23 HP printer 2nd floor
193.132.80.198 JJ. Peter's scanner
193.132.80.38 SPARE
193.132.80.78 P.Edron分类后结果如下:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ sort -t. +3n iplist
193.132.80.23 HP printer 2nd floor
193.132.80.38 SPARE
193.132.80.78 P.Edron
193.132.80.123 dave tansley
193.132.80.198 JJ. Peter's scannersort结束
uniq用法
u n i q用来从一个文本文件中去除或禁止重复行。一般u n i q假定文件已分类,并且结果正确。
我们并不强制要求这样做,如果愿意,可以使用任何非排序文本,甚至是无规律行。

以认为u n i q有点像s o r t命令中唯一性选项。对,在某种程度上讲正是如此,但两者有一个重要区别。s o r
t的唯一性选项去除所有重复行,而u n i q命令并不这样做。重复行是什么?在u n i
q里意即持续不断重复出现的行,中间不夹杂任何其他文本,现举例如下:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat myfile.txt
May Day
May Day
May Day
Going DOwn
May Day
May Day.
May Dayu n i q将前三个May Day看作重复副本,但是因为第4行有不同的文本,故并不认为第五行持续的May Day为其副本。u n i q将保留这一行。
命令一般格式:
                                                                                                                                                                                                                                                                          
$uniq -u d c -f input-file out-file
                                                                                                                其选项含义:
-u 只显示不重复行。
-d 只显示有重复数据行,每种重复行只显示其中一行
-c 打印每一重复行出现次数。
-f n为数字,前n个域被忽略。
一些系统不识别- f选项,这时替代使用- n。创建文件m y f i l e . t x t,在此文件上运行u n i q命令。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq myfile.txt
May Day
Going DOwn
May Day
May Day.
May Day注意第5行保留下来,其文本为最后一行May Day。如果运行sort -u,将只返回May Day和Going Down。
连续重复出现
使用- c选项显示行数,即每个重复行数目。本例中,行May Day重复出现三次
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq -c myfile.txt
      3 May Day
      1 Going DOwn
      1 May Day
      1 May Day.
      1 May Day1. 不唯一
使用- d显示重复出现的不唯一行:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq -d myfile.txt
May Day
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq -u myfile.txt
Going DOwn
May Day
May Day.2. 对特定域进行测试
使用- n只测试一行一部分的唯一性。例如- 5意即测试第5域后各域唯一性。域从1开始记数。
如果忽略第1域,只测试第2域唯一性,使用- n2,下述文件包含一组数据,其中第2域代表组代码。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat parts.txt
AK123 OPP Y13
DK122 OPP Y24
EK999 OPP M2
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat parts.txt
AK123 33 46 6u OPP ty yu
DK122 5h 67 y8 OPP ty yu
EK999 56 56 78 IIY ty yu运行u n i q,将返回所有行。因为这个文件每一行都不同。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat parts.txt
1 AK123 33 46 6u OPP ty yu
1 DK122 5h 67 y8 OPP ty yu
1 EK999 56 56 78 IIY ty yu如果指定测试在第4域后,结果就会不同。u n i q会比较三个相同的O PP,因此将返回一行。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq -f4 -c parts.txt
      2 AK123 33 46 6u OPP ty yu
      1 EK999 56 56 78 IIY ty yu指定第5域,即从第6域开始比较:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ uniq -f5 -c parts.txt
      3 AK123 33 46 6u OPP ty yu如果‘- f’返回错误,替代-n使用:
uniq:
进行排序之后,您会发现有些行是重复的。有时候该重复信息是不需要的,可以将它除去以节省磁盘空间。不必对文本行进行排序,但是您应当记住 uniq 在读取行时会对它们进行比较并将只除去两个或更多的连续行。下面的示例说明了它实际上是如何工作的:
清单 1. 用 uniq 除去重复行
                                                                                                                                                                                                                                                                          
$ cat happybirthday.txt
Happy Birthday to You!
Happy Birthday to You!
Happy Birthday Dear Tux!
Happy Birthday to You!
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt
Happy Birthday Dear Tux!
Happy Birthday to You!
Happy Birthday to You!
Happy Birthday to You!
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt | uniq
Happy Birthday Dear Tux!
Happy Birthday to You!警告:请不要使用 uniq 或任何其它工具从包含财务或其它重要数据的文件中除去重复行。在这种情况下,重复行几乎总是表示同一金额的另一个交易,将它除去会给会计部造成许多困难。千万别这么干!
有关 uniq 的更多信息
本系列文章介绍了文本实用程序,它对在手册页和信息页找到的信息作了补充。如果您打开新的终端窗口并输入 man uniq 或 info uniq,或者打开新的浏览器窗口并查看位于 gnu.org 的 uniq 手册页,那么就可以了解更多的相关信息。
如果您希望您的工作轻松点,比如只显示唯一的或重复的行,那么该怎么办呢?您可以用 -u(唯一)和 -d(重复)选项来做到这一点,例如:
清单 2. 使用 -u 和 -d 选项
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt | uniq -u
Happy Birthday Dear Tux!
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt | uniq -d
Happy Birthday to You!您还可以用 -c 选项从 uniq 中获取一些统计信息:
清单 3. 使用 -c 选项
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt | uniq -uc
      1 Happy Birthday Dear Tux!
                                                                                                                                                                                                                                                                          
$ sort happybirthday.txt | uniq -dc
      3 Happy Birthday to You!就
算 uniq 对完整的行进行比较,它仍然会很有用,但是那并非该命令的全部功能。特别方便的是:使用 -f
选项,后面跟着要跳过的字段数,它能够跳过给定数目的字段。当您查看系统日志时这非常有用。通常,某些项要被复制许多次,这使得查看日志很难。使用简单的
uniq 无法完成任务,因为每一项都以不同的时间戳记开头。但是如果您告诉它跳过所有的时间字段,您的日志一下子就会变得更加便于管理。试一试
uniq -f 3 /var/log/messages,亲眼看看。
还有另一个选项 -s,它的功能就像 -f 一样,但是跳过给定数目的字符。您可以一起使用 -f 和 -s。uniq 先跳过字段,再跳过字符。如果您只想使用一些预先设置的字符进行比较,那么该怎么办呢?试试看 -w 选项。
join用法
j o i n用来将来自两个分类文本文件的行连在一起。
下面讲述j o i n工作方式。这里有两个文件f i l e 1和f i l e 2,当然已经分类。每个文件里都有一些元素与另一个文件相关。由于这种关系, j o i n将两个文件连在一起,这有点像修改一个主文件,使之包含两个文件里的共同元素。
文本文件中的域通常由空格或t a b键分隔,但如果愿意,可以指定其他的域分隔符。一些系统要求使用j o i n时文件域要少于2 0,为公平起见,如果域大于2 0,应使用D B M S系统。
为有效使用j o i n,需分别将输入文件分类。
其一般格式为:
                                                                                                                                                                                                                                                                          
join [options] input-file1 input-file2
                                                                                                                选项:
an n 为一数字,用于连接时从文件n中显示不匹配行。例如, - a 1显示第一个文件的不匹配行,- a 2为从第二个文件中显示不匹配行。
o n.m n为文件号,m为域号。1 . 3表示只显示文件1第三域,每个n,m必须用逗号分隔,如1 . 3,2 . 1。
j n m n为文件号,m为域号。使用其他域做连接域。
t 域分隔符。用来设置非空格或t a b键的域分隔符。例如,指定冒号做域分隔符- t:。现有两个文本文件,其中一个包含名字和街道地址,称为n a m e . t x t,另一个是名字和城镇,
为t o w n . t x t。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat names.txt
M.Golls 12 Hidd Rd
P.Heller The Acre
P.Willey 132 The Grove
T.Norms 84 Connaught Rd
K.Fletch 12 Woodlea
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ cat town.txt
M.Golls Norwich NRD
P.Willey Galashiels GDD
T.Norms Brandon BSL
K.Fletch Mildenhall MAF
K.Firt Mitryl Mdt连接两个文件
连接两个文件,使得名字支持详细地址。例如M . G o l l s记录指出地址为12 Hidd Rd。连接域为域0—名字域。因为两个文件此域相同, j o i n将假定这是连接域:
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ join names.txt town.txt
M.Golls 12 Hidd Rd Norwich NRD
P.Willey 132 The Grove Galashiels GDD
T.Norms 84 Connaught Rd Brandon BSL
K.Fletch 12 Woodlea Mildenhall MAF缺省j o i n删除或去除连接键的第二次重复出现,这里即为名字域。
1. 不匹配连接
如果一个文件与另一个文件没有匹配域时怎么办?这时j o i n不可以没有参数选项,经常指定两个文件的- a选项。下面的例子显示匹配及不匹配域。
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ join -a1 -a2 names.txt town.txt
M.Golls 12 Hidd Rd Norwich NRD
P.Heller The Acre
P.Willey 132 The Grove Galashiels GDD
T.Norms 84 Connaught Rd Brandon BSL
K.Fletch 12 Woodlea Mildenhall MAF
K.Firt Mitryl Mdt
                                                                                                                                                                                                                                                                          
[sam@chenwy sam]$ join -a1  names.txt town.txt
M.Golls 12 Hidd Rd Norwich NRD
P.Heller The Acre
P.Willey 132 The Grove Galashiels GDD
T.Norms 84 Connaught Rd Brandon BSL
K.Fletch 12 Woodlea Mildenhall MAF
split用法
s p l i t用来将大文件分割成小文件。有时文件越来越大,传送这些文件时,首先将其分割可能更容易。使用v i或其他工具诸如s o r t时,如果文件对于工作缓冲区太大,也会存在一些问题。
因此有时没有选择余地,必须将文件分割成小的碎片。
s p l i t命令一般格式:
                                                                                                                                                                                                                                                                          
split -output_file-size input-filename output-filename这里o u t p u t - f i l e - s i z e指的是文本文件被分割的行数。
s p l i t查看文件时,o u t p u t - f i l e - s i z e选项指定将文件按每个最多1 0 0 0行分割。如果有个文件有38行,那么将分割成3个文件,分别有
10、10、10、8行。每个文件格式为x [ a a ]到x [ z z ],x为文件名首字母, [ a a ]、[ z z ]为文件名剩余部分顺序字符组合,下面的例子解释这一点。
如passwd有38行:
                                                                                                                                                                                                                                                                          
[sam@chenwy split]$ ls -l
总用量 8
-rw-r--r--    1 sam      sam          1649 12月  4 11:13 passwd
-rw-rw-r--    1 sam      sam            84 12月  4 11:19 split1
                                                                                                                                                                                                                                                                          
[sam@chenwy split]$ split -10 passwd
[sam@chenwy split]$ ls -l
总用量 24
-rw-r--r--    1 sam      sam          1649 12月  4 11:13 passwd
-rw-rw-r--    1 sam      sam            84 12月  4 11:19 split1
-rw-rw-r--    1 sam      sam           368 12月  4 11:24 xaa
-rw-rw-r--    1 sam      sam           474 12月  4 11:24 xab
-rw-rw-r--    1 sam      sam           495 12月  4 11:24 xac
-rw-rw-r--    1 sam      sam           312 12月  4 11:24 xad生成了四个文件,前三个文件每个文件10行,最后一个8行,分割分的文件名自动产生,格式为x[a-a][z-z]
再如split有6行:
                                                                                                                                                                                                                                                                          
[sam@chenwy split]$ cat split1
this is line1
this is line2
this is line3
this is line4
this is line5
this is line6按每个文件1行分割,命令为:
                                                                                                                                                                                                                                                                          
[sam@chenwy split]$ split -1 split1
[sam@chenwy split]$ ls -l
总用量 32
-rw-r--r--    1 sam      sam          1649 12月  4 11:13 passwd
-rw-rw-r--    1 sam      sam            84 12月  4 11:19 split1
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xaa
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xab
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xac
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xad
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xae
-rw-rw-r--    1 sam      sam            14 12月  4 11:25 xaf文件有6行,s p l i t按每个文件1行进行了分割,并按字母顺序命名文件。为进一步确信操作成功,观察一个新文件内容:
                                                                                                                                                                                                                                                                          
[sam@chenwy split]$ cat xaa
this is line1
[sam@chenwy split]$ cat xaf
this is line6
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics