`

centos7( 3.10.0-123.el7.x86_64) 重启问题

阅读更多
 
centos7( 3.10.0-123.el7.x86_64) 重启问题  http://aperise.iteye.com/blog/2326082
centos7( 3.10.0-327.el7.x86_64) 重启问题 http://aperise.iteye.com/blog/2425717
 

centos7( 3.10.0-123.el7.x86_64) 重启问题

1.问题

       新买来服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos 7操作系统,搭建好hadoop集群和spark集群后,最近跑spark任务,发现任务执行到一定次数后,服务器中会随机的无端的会有一台自动重启

 

2.最初解决思路

       1)既然是操作系统层面都自动重启了,那么首先应该是从操作系统crash日志入手查找问题

       2)不管咋的spark程序还没啥权利去主动重启服务器,如果是资源不够,起码操作系统层面都要将spark程序申请资源进行拒绝

 

       联系了系统运维工程师,让协助查下服务器是否有crash日志,反馈的结果是没有任何crash日志,天真让我思考是不是spark将服务器资源耗尽,连服务器写crash的日志能力都不具备了。天真

       因为自己专长的领域不在系统运维,所以对于系统运维工程师给的回复,还是比较相信的,起码首先没有去怀疑(虽然这在最后被证明是致命的错误判断),于是花了一大堆时间在盘查hadoop集群和spark集群资源消耗(CPU  MEM IO),在这期间主要是通过工具nmon来抓取所有服务器详细参数,疯狂的跑spark任务使问题重现,然后分析nmon日志信息。

 

3.nmon抓取服务器各个性能参数

      nmon是一款分析 AIX 和 Linux 性能的免费工具,这里也顺便介绍下该工具使用,我下载的版本主要有一下两个文件:

  • nmon_x86_64_centos6.centos6          nmon工具,主要抓取服务器资源日志,日志存为机器hostname_年月日_时分.nmon
  • nmon analyser v40.xlsm                       主要讲上面的“机器hostname_年月日_时分.nmon”转换为可读性的excel图表

    3.1 nmon命令参数介绍

敲入如下命令,获取nmon命令使用介绍
cd /home/hadoop/nmon
./nmon_x86_64_centos6 -h
提示如下信息:
Hint: nmon_x86_64_centos6 [-h] [-s <seconds>] [-c <count>] [-f -d <disks> -t -r <name>] [-x]

-h FULL help information
Interactive-Mode:
read startup banner and type: "h" once it is running
For Data-Collect-Mode (-f)
-f spreadsheet output format [note: default -s300 -c288]
optional
-s <seconds> between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-d <disks> to increase the number of disks [default 256]
-t spreadsheet includes top processes
-x capacity planning (15 min for 1 day = -fdt -s 900 -c 96)

Version - nmon 14i

For Interactive-Mode
-s <seconds> time between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-g <filename> User Defined Disk Groups [hit g to show them]
- file = on each line: group_name <disks list> space separated
- like: database sdb sdc sdd sde
- upto 64 disk groups, 512 disks per line
- disks can appear more than once and in many groups
-b black and white [default is colour]
example: nmon_x86_64_centos6 -s 1 -c 100

For Data-Collect-Mode = spreadsheet format (comma separated values)
Note: use only one of f,F,z,x or X and make it the first argument
-f spreadsheet output format [note: default -s300 -c288]
output file is <hostname>_YYYYMMDD_HHMM.nmon
-F <filename> same as -f but user supplied filename
-r <runname> used in the spreadsheet file [default hostname]
-t include top processes in the output
-T as -t plus saves command line arguments in UARG section
-s <seconds> between snap shots
-c <number> of snapshots before nmon stops
-d <disks> to increase the number of disks [default 256]
-l <dpl> disks/line default 150 to avoid spreadsheet issues. EMC=64.
-g <filename> User Defined Disk Groups (see above) - see BBBG & DG lines
-N include NFS Network File System
-I <percent> Include process & disks busy threshold (default 0.1)
don't save or show proc/disk using less than this percent
-m <directory> nmon changes to this directory before saving to file
example: collect for 1 hour at 30 second intervals with top procs
nmon_x86_64_centos6 -f -t -r Test1 -s30 -c120

To load into a spreadsheet:
sort -A *nmon >stats.csv
transfer the stats.csv file to your PC
Start spreadsheet & then Open type=comma-separated-value ASCII file
The nmon analyser or consolidator does not need the file sorted.

Capacity planning mode - use cron to run each day
-x sensible spreadsheet output for CP = one day
every 15 mins for 1 day ( i.e. -ft -s 900 -c 96)
-X sensible spreadsheet output for CP = busy hour
every 30 secs for 1 hour ( i.e. -ft -s 30 -c 120)

Interactive Mode Commands
key --- Toggles to control what is displayed ---
h = Online help information
r = Machine type, machine name, cache details and OS version + LPAR
c = CPU by processor stats with bar graphs
l = long term CPU (over 75 snapshots) with bar graphs
m = Memory stats
L = Huge memory page stats
V = Virtual Memory and Swap stats
k = Kernel Internal stats
n = Network stats and errors
N = NFS Network File System
d = Disk I/O Graphs
D = Disk I/O Stats
o = Disk I/O Map (one character per disk showing how busy it is)
o = User Defined Disk Groups
j = File Systems
t = Top Process stats use 1,3,4,5 to select the data & order
u = Top Process full command details
v = Verbose mode - tries to make recommendations
b = black and white mode (or use -b option)
. = minimum mode i.e. only busy disks and processes

key --- Other Controls ---
+ = double the screen refresh time
- = halves the screen refresh time
q = quit (also x, e or control-C)
0 = reset peak counts to zero (peak = ">")
space = refresh screen now

Startup Control
If you find you always type the same toggles every time you start
then place them in the NMON shell variable. For example:
export NMON=cmdrvtan

Others:
a) To you want to stop nmon - kill -USR2 <nmon-pid>
b) Use -p and nmon outputs the background process pid
c) To limit the processes nmon lists (online and to a file)
Either set NMONCMD0 to NMONCMD63 to the program names
or use -C cmd:cmd:cmd etc. example: -C ksh:vi:syncd
d) If you want to pipe nmon output to other commands use a FIFO:
mkfifo /tmp/mypipe
nmon -F /tmp/mypipe &
grep /tmp/mypipe
e) If nmon fails please report it with:
1) nmon version like: 14i
2) the output of cat /proc/cpuinfo
3) some clue of what you were doing
4) I may ask you to run the debug version

Developer Nigel Griffiths
Feedback welcome - on the current release only and state exactly the problem
No warranty given or implied.

 

 

    3.2 服务器上安装nmon

          将nmon_x86_64_centos6.centos6拷贝到服务器一个目录,比如/home/hadoop/nmon下:


 

    3.3 服务器上抓取日志

执行如下命令抓取服务器参数 写道
cd /home/hadoop/nmon
./nmon_x86_64_centos6 -f -t -r name_view_in_excel_sheet -s 15 -c 960
ls

    上述命令意思:每个15秒抓取一次数据,供抓取960次,“name_view_in_excel_sheet”是后续显示在excel中的图标名称,一般这里设置为服务器的hostname

 

    3.4服务器上日志文件转换为excel图表

     第一步:打开文件“nmon analyser v40.xlsm”,点击按钮“Analyze nmon data”,选中上面获取的性能日志文件“hadoop31_160921_2357.nmon”,如下:



     这一步会读取“hadoop31_160921_2357.nmon”的内容,将内容通过excel图表方式进行展示,最终生成一个excel文件,如下:



 

       通过nmon,发现hadoop集群和spark集群消耗的资源还是正常的,唯一不正常的是,每次跑完spark任务,各个服务器上内存消耗在cache上的内存达到了惊人的80G以上,而且问题在于,就算hadoop集群和spark集群所有服务关闭,cache好几天都无法自动释放。在这里也做过实验,如果每次手动释放cache,操作如下:

手动释放cache :
free -m
sync
echo 1 > /proc/sys/vm/drop_caches
clear
free -m

     然后跑spark任务,从来没出现过服务器自动重启的情况。

 

    总之,此次不得不说是练习了一把如何使用nmon工具分析系统性能,从nmon上分析出cache是没有释放,而这将问题产生的根本还是指向了操作系统,按理操作系统是会慢慢释放cache的

 

4.回到原点解决问题

    1)开始怀疑系统运维工程师的判断,原因是系统无端自动重启,竟然毫无征兆,找不到日志;

    2)既然找不到日志,那么就要想办法让系统运维工程师去主动找到日志,一是操作系统层面crash日志,二是抓取java程序的core dump日志;

    3)开始再次联系系统运维工程师,务必说服他去找到上面两个日志

    4)成功说服系统运维工程师,开始着手协助拿取上述日志;

    5)好消息来了,系统运维工程师拿到了系统crash日志,从日志中发现如下致命错误:



    查看文件vmcore-dmesg.txt,发现如下错误(吓人的 kenel BUG):


     是一个centos 7的内核级BUG,我的linux内核版本如下:



     centos7官网介绍如下:



     由于内存中的page table entry 产生争用,触发了kernel crash。

 

     6)在网上找了一下,也有人碰到使用centos7(Linux version 3.10.0-123.el7.x86_64 )出现类似的问题,RH5885 V3 CentOS7.0(Redhat7.0)内核问题导致系统自动重启



 



 



               

 

      至此,困扰多时的问题终于解决了,毫无疑问的升级centos7 内核版本至

 

      

 

  • 大小: 16.9 KB
  • 大小: 36.6 KB
  • 大小: 166.5 KB
  • 大小: 155.3 KB
  • 大小: 31.5 KB
  • 大小: 51.3 KB
  • 大小: 40.1 KB
  • 大小: 104.1 KB
  • 大小: 276.6 KB
  • 大小: 384.7 KB
  • 大小: 14.3 KB
  • 大小: 3.6 KB
分享到:
评论
3 楼 zilongzilong 2017-12-27  
furyamber 写道

你好,我们也遇到了一模一样的错误,之前看了个
https://bugs.centos.org/view.php?id=7474
才大概知道是系统内核的问题,我们的内核版本也跟你一样
最后解决的话,直接升级内核就可以吗



恩,是的,升级内核就可以了,这个是内核级BUG
2 楼 zilongzilong 2017-12-27  
恩,是的,升级内核就可以了,这个是内核级BUG
1 楼 furyamber 2017-12-25  

你好,我们也遇到了一模一样的错误,之前看了个
https://bugs.centos.org/view.php?id=7474
才大概知道是系统内核的问题,我们的内核版本也跟你一样
最后解决的话,直接升级内核就可以吗

相关推荐

Global site tag (gtag.js) - Google Analytics