Solaris is one of the most stable operating systems available... but
lets face it, stuff happens. Solaris does panic, but I want everyone to
be clear, a "panic", despite the seemingly contradictory name, is by
its nature a controlled event. When the kernel encounters behavior that
is uncorrectable and will cause irreparable harm to the running system
or, even worse, corrupt data, the system will voluntarily tap out using
the panic system call to get the system down quickly, hopefully leaving
a core dump in its wake for post-mortem analysis.
In this blog entry we'll discuss core dumps and panic's in general.
In part 2 we'll discuss a tool to make life just a little easier, the
Solaris Crash Analysis Tool, or "Solaris CAT".
I want to point out that post-mortem core analysis is really
the task of a kernel engineer. The fact is, way less than 1% of us who
ever engage in core analysis are actually going to have any real idea
of what the hell we're doing. And thats ok! I guarantee that you'll
post something from an analysis to a mailing list and you'll get some
asshole who forgets that he's been paid to work on the Solaris kernel
for the last 20 years while you work a job which is now on hold because
of said core dump, with replies like "We can clearly see that due to
the memory address in this register that you are a moron...." The point
here is, if you don't know what your doing, don't be discouraged. What
we, mere mortals, are trying to do is not necessarily solve the problem
but provide clues which will help us guide our search, either by
posting a stack trace to a mailing list, or send the dump to Sun
Support, or to take a panic string and search the bug database or google for. The cuddletech rule of crashes is:
An unexpected crash is unacceptable; An unexplained crash is inexcusable.
If you're reading this you've probably lived through a panic before,
but lets recap. The best explanation of a "crash" event and resulting
dump can be found in the dumpadm(1M) man page:
A crash dump is a disk copy of the physical memory
of the computer at the time of a fatal system error. When a
fatal operating system error occurs, a message describing
the error is printed to the console. The operating system
then generates a crash dump by writing the contents of phy-
sical memory to a predetermined dump device, which is typi-
cally a local disk partition. The dump device can be config-
ured by way of dumpadm. Once the crash dump has been writ-
ten to the dump device, the system will reboot.
Fatal operating system errors can be caused by bugs in the
operating system, its associated device drivers and loadable
modules, or by faulty hardware. Whatever the cause, the
crash dump itself provides invaluable information to your
support engineer to aid in diagnosing the problem. As such,
it is vital that the crash dump be retrieved and given to
your support provider. Following an operating system crash,
the savecore(1M) utility is executed automatically during
boot to retrieve the crash dump from the dump device, and
write it to a pair of files in your file system named unix.X
and vmcore.X, where X is an integer identifying the dump.
Together, these data files form the saved crash dump. The
directory in which the crash dump is saved on reboot can
also be configured using dumpadm.
I encourage you to read both the savecore(1M) and dumpadm(1M) man pages. You'll find that with savecore -L
you can create a dump of a live system, so if you don't have a crashed
system around to play with, use that. Alternatively, you can use reboot -d to dump a core and reboot.
At this point we'll assume you have a dump available. By default you'll find them in /var/crash/hostname/,
you'll have dumps in pairs: vmcore.0 and unix.0. We feed these two
files to mdb, the (-k, kernel) Modular DeBugger, to preform our
analysis like so:
# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs ip sctp usba lofs zfs random ipc md fcip fctl fcp crypto logindmux ptm nfs ]
>
You are now free to move about the dump. mdb commands are strange
and unusual at first, it takes a lot of time to get comfortable with
it, but there are a couple of debugger commands that can give us the
essence of what we need. Lets walk through them.
The ::status command will display high level information
regarding this debugging session. Of usefulness here is the dumps
"panic message" and OS release.
> ::status
debugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_43 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only
The ::stack command will prove you with a stack trace, this is the same thing trace you would have seen in syslog or the console.
> ::stack
atomic_add_32()
nfs_async_inactive+0x55(fffffe820d128b80, 0, ffffffffeff0ebcb)
nfs3_inactive+0x38b(fffffe820d128b80, 0)
fop_inactive+0x93(fffffe820d128b80, 0)
vn_rele+0x66(fffffe820d128b80)
snf_smap_desbfree+0x78(fffffe8185e2ff60)
dblk_lastfree_desb+0x25(fffffe817a30f8c0, ffffffffac1d7cc0)
dblk_decref+0x6b(fffffe817a30f8c0, ffffffffac1d7cc0)
freeb+0x89(fffffe817a30f8c0)
tcp_rput_data+0x215f(ffffffffb4af7140, fffffe812085d780, ffffffff993c3c00)
squeue_enter_chain+0x129(ffffffff993c3c00, fffffe812085d780, fffffe812085d780, 1, 1)
ip_input+0x810(ffffffffa23eec68, ffffffffaeab8040, fffffe812085d780, e)
i_dls_link_ether_rx_promisc+0x266(ffffffff9a4c35f8, ffffffffaeab8040, fffffe812085d780)
mac_rx+0x7a(ffffffffa2345c40, ffffffffaeab8040, fffffe812085d780)
e1000g_intr+0xf6(ffffffff9a4b2000)
av_dispatch_autovect+0x83(1a)
intr_thread+0x50()
The ::msgbuf command will output the message buffer at the
time of crash; the message buffer is most commonly used by sysadmins
through the "dmesg" command.
> ::msgbuf
MESSAGE
....
WARNING: IP: Hardware address '00:14:4f:xxxxxxx' trying to be our address xxxx
WARNING: IP: Hardware address '00:14:4f:xxxx' trying to be our address xxxx
panic[cpu0]/thread=fffffe80000adc80:
BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
sched:
#pf Page fault
Bad kernel fault at addr=0x0
.... blah blah, snipped for brevity.
The ::panicinfo command will give you lots of fun cryptic
counter information, of most interest is the first 3 lines, which
contain the CPU on which the panic occured, the running thread, and the
panic message. You'll notice these are commonly repeated and the most
useful pieces of information.
> ::panicinfo
cpu 0
thread fffffe80000adc80
message BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
rdi 0
rsi 1
rdx fffffe80000adc80
rcx 0
r8 0
r9 fffffe80dba125c0
rax 0
rbx fffffe8153a36040
rbp fffffe80000ad4e0
r10 3e0
r10 3e0
r11 ffffffffaeab8040
r12 ffffffffb7b4cac0
r13 0
r14 fffffe820d128b80
r15 ffffffffeff0ebcb
fsbase ffffffff80000000
gsbase fffffffffbc27850
ds 43
es 43
fs 0
gs 1c3
trapno e
err 2
rip fffffffffb838680
cs 28
rflags 10246
rsp fffffe80000ad4c8
ss 0
gdt_hi 0
gdt_lo defacedd
idt_hi 0
idt_lo 80300fff
ldt 0
task 60
cr0 80050033
cr2 0
cr3 10821b000
In my opinion, the koolest command is ::cpuinfo -v. Truth
be told, if you run multiple applications on a server the most common
question people (especially managers) want answered is "which
application did it?", being translated into geek-esse "who do I blame?"
This command will help you determine that by displaying, complete with
beautiful ASCII art, the threads and process names running on each CPU
(NRUN). In the following example, we know the event occured on CPU 0,
thus thats the one we want to look at. Note that the "sched" process
should be interpreted as "kernel".
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc2f370 1b 1 0 165 no no t-1 fffffe80000adc80 sched
| | |
RUNNING <--+ | +--> PIL THREAD
READY | 6 fffffe80000adc80
EXISTS | - fffffe80daab6a20 ruby
ENABLE |
+--> PRI THREAD PROC
99 fffffe8000b88c80 sched
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 ffffffff983b3800 1f 1 0 59 yes no t-0 fffffe80daac2f20 smtpd
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 99 fffffe8000bacc80 sched
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
2 ffffffff9967a800 1f 2 0 -1 no no t-0 fffffe8000443c80
(idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 99 fffffe8000b82c80 sched
QUIESCED 60 fffffe80018f8c80 sched
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
3 ffffffff9967a000 1f 1 0 -1 no no t-0 fffffe8000535c80
(idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 60 fffffe8000335c80 zsched
QUIESCED
EXISTS
ENABLE
The ::ps command allows us to see all running processes. Several flags are supported, including -z to display Zone ID's.
> ::ps -z
S PID PPID PGID SID ZONE UID FLAGS ADDR NAME
R 0 0 0 0 0 0 0x00000001 fffffffffbc25900 sched
R 3 0 0 0 0 0 0x00020001 ffffffff9970d928 fsflush
R 2 0 0 0 0 0 0x00020001 ffffffff9970e558 pageout
R 1 0 0 0 0 0 0x42004000 ffffffff9970f188 init
R 20534 1 20533 20533 24 1006 0x42010400 ffffffffb246f9b8 ruby
R 20532 1 20531 20531 24 1006 0x42010400 fffffe8109674308 ruby
R 20529 1 20528 20528 24 1006 0x42010400 fffffe80dc5602f0 ruby
...
We can use ::pgrep to search for processes and use the
appropriate address for further digging. In the following example I'll
find a Java process and then determine which zone that process was
running in:
> ::pgrep java
S PID PPID PGID SID UID FLAGS ADDR NAME
R 3628 1 3620 3574 0 0x42004400 fffffe80deeb3240 java
> fffffe80deeb3240::print proc_t p_zone->zone_name
p_zone->zone_name = 0xffffffffae0cef00 "testzone03"
There are many more tools and way to dig into your dumps using mdb.
It can be confusing because you need to reference things by address,
but you get more comfortable with it as you play around. If you are
interested in learning more I highly recommend reading Eric Lowe's "Examining the Anatomy of a Process", which digs into the topic of process examination via mdb.
One thing you'll notice in all this is that the messages at the time of
crash on the console or in syslog contain almost everything you need to
know without digging too deeply. Therefore, assuming you have those
messages, the most useful thing most people will extract from the core
files is the output of the ::cpuinfo
command to see what process was on the offending CPU at the time of the
crash. Knowing what processes, zones, etc, were running at the time of
crash are interesting but rarely mean much if they weren't directly
involved in the panic.
As I said, once you start getting into referencing memory addresses
to deepen your analysis things get sticky and tricky very very
quickly... thats where Solaris CAT comes in, which we'll talk about in
part 2.
分享到:
相关推荐
"The Solaris™Internals volumes are simply the best and most comprehensive treatment of the Solaris (and OpenSolaris) Operating Environment. Any person using Solaris--in any capacity--would be remiss ...
Eplan 备件库
误删了eplan部件库的兄嘚,可以自取,我也经历过这事,同理心免费拿走。 覆盖位置:C:\Users\Public\EPLAN\Data\部件\China
MDB Viewer Plus是一款免费的MDB数据库浏览编辑器软件!它不需要安装运行微软提供的数据访问组件( MDAC ),就可以查看,编辑,查询和搜索Access MDB文件! MDB Viewer Plus/MDB数据库浏览编辑软件已经更新到最新版本...
最近刚好有个项目要连接ACCESS的MDB数据并导入到ORACLE中,使用 Java代码 Class.forName("sun.jdbc.odbc.JdbcOdbcDriver"); String strurl="jdbc:odbc:driver={Microsoft Access Driver(*.mdb)};DBQ=E:\\\\db.mdb...
1.视频讲解MDB个人地理数据库批量合并案例,解决FME中数据合并类功能需求。 2.本视频讲解中使用的FME版本为2020.2中文版。 3.本资源仅为视频讲解,用户需自行安装相应版本的FME。
mdb 使用手册 mdb 使用手册 mdb 使用手册 mdb 使用手册 mdb 使用手册
MDB CLIβ 创建和托管MDB项目的最快方法。 特征 :light_bulb: 用一个命令启动您的项目⠀-减少您的工作时间!⠀⠀ :check_mark: :blue_book: 使用单个命令发布项目-只需发布并获取链接即可共享 :check_mark:...
Solaris™ Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris 针对Solaris和opensolaris程序开发的指导
欢迎来到令人激动的MDB工具世界! MDB工具是一组程序,可帮助您从各种设置的Microsoft Access文件中提取数据。 有关最新版本的信息,请参阅文件。组件MDB工具的主要部分包括:的libmdb 允许以编程方式访问MDB文件的...
MDB 数据库 修改工具 MDB数据库修改工具
mdb-sqlite.NET mdb-sqlite.NET 最初基于 [mdb-sqlite] 项目( ),由 Landon Fuller( )创建。 我做了一些修改并将其翻译成 .NET C# 版本。 要使用这个工具,你需要通过 NuGet 安装 [System.Data.SQLite] 包( )...
详细介绍了如何在ArcGIS Engine开发中加载MDB中的数据, 带有C#开发的代码
无法打开数据库C:\Users\Public\EPLAN\data\部件\Microsoft\ESS_part001.mdb。针对64位版本的EPLAN平台需要使用64位版本的Microsoft Office。
MDB 协议MDB 协议MDB 协议
模块调试器(ModularDebugger,MDB)是用于Solaris的通用调试工具,其主要特性是可扩展 性。本书介绍如何使用MDB调试复杂的软件系统,尤其着重介绍了可用于调试Solaris内 核以及关联的设备驱动程序和模块的工具。本书还...
MDB-RS232专业版测试工具,可以通过威佛MDB-RS232适配器对MDB纸币器(08H),硬币器(30H),非现金支付1(10H),非现金支付2(60H),USB设备(40H),年龄验证设备(68H)进行综合测试。当然下载测试工具之前需要...
省市县邮编 db1.mdbdb1.mdb
Northwind-1.mdb