`
piperzero
  • 浏览: 3482196 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
文章分类
社区版块
存档分类
最新评论

HttpException "Request timed out"的发生原因及调试方法

 
阅读更多

问题简介

ASP.NET程序中我们经常遇到错误消息为"Request Timed Out"HttpException,浏览器中的出错信息如下图所示,

同时应用程序日志中也会出现一条警告消息如下



如何重现这种异常?

线程在处理逻辑的过程中是如何被打断而抛出超时异常?

如何调试这种异常?

这篇文章我们从CLR源代码级别来分析一下该问题发生的原因以及相应的调试方法。

重现问题

首先我们先尝试重现这个问题,在ASP.NET程序web.config中有一个executionTimeout的设置,默认值为110秒,也就是说一个请求的执行时间超过110秒就会timeout从而产生HttpException "Request Timed Out"。这个设置可以通过IIS Configuration Editor或者web.config进行更改。

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <appSettings />
    <connectionStrings />
    <system.web>
        <compilation>
        </compilation>        
        <authentication mode="Windows" />
        <httpRuntime executionTimeout="10" />        
    </system.web>

</configuration>

为了快速重现这个问题我们将executionTimeout默认值改为10秒。然后建立一个让请求超时的asp.net页面如下。

<%@ Page Language="C#" AutoEventWireup="true" Inherits="System.Web.UI.Page" %>
<script runat="server">
    protected void Page_Load(object sender, EventArgs e)
    {
        System.Threading.Thread.Sleep(120000);
    }

</script>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
    
    </div>
    </form>
</body>
</html>

将页面部署在IIS上,然后请求页面,等待10秒到25秒左右问题重现。

产生原理

这里有一个值得注意的地方,为什么我们页面逻辑在执行Sleep却抛出了HttpException,执行线程被谁打断了,是怎么打断的?

研究这个问题我们首先要用windbg来察看一下这个异常发生的调用栈信息。

通过windbg attachw3wp进程,然后输入命令'sxe clr'捕获clrfirst chance exception

CLR exception - code e0434f4d (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
KERNELBASE!RaiseException+0x68:
000007fb`bbb789cc 488b8c24c0000000 mov rcx,qword ptr [rsp+0C0h] ss:000000da`6bf0da20=38a1bff3b7640000

0:023> KL
Child-SP RetAddr Call Site
000000da`6bf0d960 000007fb`a0da98ef KERNELBASE!RaiseException+0x68
000000da`6bf0da40 000007fb`a0d03e07 mscorwks!RaiseTheExceptionInternalOnly+0x2ff
000000da`6bf0db30 000007fb`a0cc58cf mscorwks!Thread::HandleThreadAbort+0x19b
000000da`6bf0dbd0 000007fb`a0cbe815 mscorwks!Thread::HandleThreadInterrupt+0x7b
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x89
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9

000000da`6bf0de40 000007fb`41630f39 App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe

000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21


栈上面的first chanceexceptionThreadAbortException

0:023> !dso
OS Thread Id: 0x12e0 (23)
RSP/REG Object Name
000000da6bf0d968 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da48 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0db30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0dc98 000000da0aa93c90 System.Web.HttpContext
000000da6bf0dca0 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd70 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd90 000000da0aa93c90 System.Web.HttpContext

从这个调用栈上我们可以看出来这个异常来自于Thread::UserSleep -> Thread::HandleThreadInterrupt ->Thread::HandleThreadAbort

由于这个调用栈是CLR非托管代码,我们需要一份.NETFrameworkCLR的源代码作为参考,可以到微软下载一份Share Source CLI。找到\sscli20_20060311\sscli20\clr\src\vm\threads.cpp这个源文件,这个类封装的一系列CLR线程相关的属性和方法。

因为我们调用了Sleep方法,首先看一下Sleep函数的实现。原来线程调用Sleep方法之后进入系统内核,内核将其置于等待状态,一旦CLRSleepEx返回WAIT_IO_COMPLETION并且通过m_State & TS_Interrupted两个标志位来判断是interrupt的话,系统将开始执行HandleThreadInterrupt方法,进而在HandleThreadInterrupt方法中抛出ThreadAbortException

// Implementation of Thread.Sleep().
void Thread::UserSleep(INT32 time)
{
    CONTRACTL {
        THROWS;
        GC_TRIGGERS;
    }
    CONTRACTL_END;

    INCONTRACT(_ASSERTE(!GetThread()->GCNoTrigger()));

    DWORD   res;

    GCX_PREEMP();

    // A word about ordering for Interrupt.  If someone tries to interrupt a thread
    // that's in the interruptible state, we queue an APC.  But if they try to interrupt
    // a thread that's not in the interruptible state, we just record that fact.  So
    // we have to set TS_Interruptible before we test to see whether someone wants to
    // interrupt us or else we have a race condition that causes us to skip the APC.
    FastInterlockOr((ULONG *) &m_State, TS_Interruptible);

    // If someone has interrupted us, we should not enter the wait.
    if (IsUserInterrupted())
    {
        HandleThreadInterrupt(FALSE);
    }

    ThreadStateHolder tsh(TRUE, TS_Interruptible | TS_Interrupted);

    FastInterlockAnd((ULONG *) &m_State, ~TS_Interrupted);

retry:

    res = ClrSleepEx (time, TRUE);

    if (res == WAIT_IO_COMPLETION)
    {
        // We could be woken by some spurious APC or an EE APC queued to
        // interrupt us. In the latter case the TS_Interrupted bit will be set
        // in the thread state bits. Otherwise we just go back to sleep again.
        if ((m_State & TS_Interrupted))
        {
            HandleThreadInterrupt(FALSE);
        }

        // Don't bother with accurate accounting here.  Just ensure we make progress.
        // Note that this is not us doing the APC.
        if (time == 0)
        {
            res = WAIT_TIMEOUT;
        }
        else
        {
            if (time != (INT32) INFINITE)
            {
                time--;
            }

            goto retry;
        }
    }
    _ASSERTE(res == WAIT_TIMEOUT || res == WAIT_OBJECT_0);
}

接下来一个问题是到底是什么使得ClrSleepEx方法返回了呢?根据注释我们可以看到APC很可能是打断等待线程的原因。接下来我们继续通过debug验证一下。根据msdn用户模式可以调用kernel32!QueueUserAPC方法发送一个APC到指定线程。我们在这个方法上设置一个断点然后重现这个问题。结果抓到了这个调用栈。

0:022> kL
Child-SP RetAddr Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167

000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

看一下栈上的信息,究竟是哪个线程正在被abort

0:022> !dso
OS Thread Id: 0x14e4 (22)
RSP/REG Object Name
000000da6bbfe9c0 000000da0aa84400 System.Threading.Thread
000000da6bbfea28 000000d9cab1efe8 System.Web.HttpApplication+CancelModuleException
000000da6bbfea38 000000da0aa84400 System.Threading.Thread
000000da6bbfea88 000000d9caa4c810 System.String

0:022> !do 000000da0aa84400
Name: System.Threading.Thread
MethodTable: 000007fb998b86c0
EEClass: 000007fb994bea18
Size: 104(0x68) bytes
(C:\windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
Fields:
MT Field Offset Type VT Attr Value Name
000007fb998f01b0 4000690 8 ....Contexts.Context 0 instance 000000d9caa285c0 m_Context
000007fb998b2bd0 4000691 10 ....ExecutionContext 0 instance 000000da0aace570 m_ExecutionContext
000007fb998b7c20 4000692 18 System.String 0 instance 0000000000000000 m_Name
000007fb998b8438 4000693 20 System.Delegate 0 instance 0000000000000000 m_Delegate
000007fb998aefe0 4000694 28 System.Object[][] 0 instance 0000000000000000 m_ThreadStaticsBuckets
000007fb998bed40 4000695 30 System.Int32[] 0 instance 0000000000000000 m_ThreadStaticsBits
000007fb998c05a0 4000696 38 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentCulture
000007fb998c05a0 4000697 40 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentUICulture
000007fb998b7510 4000698 48 System.Object 0 instance 0000000000000000 m_ThreadStartArg
000007fb998ba628 4000699 50 System.IntPtr 1 instance da6b789d60 DONT_USE_InternalThread
000007fb998bee90 400069a 58 System.Int32 1 instance 2 m_Priority
000007fb998bee90 400069b 5c System.Int32 1 instance 7 m_ManagedThreadId
000007fb998ac8d8 400069c 2d8 ...LocalDataStoreMgr 0 shared static s_LocalDataStoreMgr
>> Domain:Value 000000d9c90512d0:0000000000000000 000000d9ca176610:0000000000000000 <<
000007fb998b7510 400069d 2e0 System.Object 0 shared static s_SyncObject
>> Domain:Value 000000d9c90512d0:000000d9caa23bf0 000000d9ca176610:000000da0aa2d410 <<

0:022> !threads
ThreadCount: 10
UnstartedThread: 0
BackgroundThread: 9
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
PreEmptive Lock
ID OSID ThreadOBJ State GC GC Alloc Context Domain Count APT Exception
3 1 150c 000000d9ca125a90 8220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
19 2 23c8 000000d9ca12f510 b220 Enabled 000000da0aa8a238:000000da0aa8c200 000000d9c90512d0 0 MTA (Finalizer)
20 3 1a98 000000d9ca174ab0 80a220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 MTA (Threadpool Completion Port)
21 4 23f4 000000d9ca175b60 1220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
22 5 14e4 000000da6b779c50 180b220 Disabled 000000d9cab1f000:000000d9cab20b18 000000d9ca176610 1 MTA (Threadpool Worker)
XXXX 6 0 000000da6b77d9d0 10820 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
23 7 12e0 000000da6b789d60 3a0b221 Enabled 000000da0aafd860:000000da0aafe2a8 000000d9ca176610 1 MTA (Threadpool Worker)
15 8 2180 000000da6b78a330 880a220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 MTA (Threadpool Completion Port)
24 9 13a4 000000da6b7903a0 220 Enabled 000000d9caae2e48:000000d9caae4aa8 000000d9c90512d0 0 Ukn
25 a e5c 000000d9ca107180 220 Enabled 000000d9cab20c68:000000d9cab22b18 000000d9c90512d0 0 Ukn


结果我们看到是23号线程是被abort的目标。23号线程正在做什么?查看23号线程调用栈正好是我们的Sleep方法线程。

0:023> kL
Child-SP RetAddr Call Site
000000da`6bf0db08 000007fb`bbb611f2 ntdll!NtDelayExecution+0xa
000000da`6bf0db10 000007fb`a10c369d KERNELBASE!SleepEx+0xaa
000000da`6bf0dbb0 000007fb`a0cbe7fd mscorwks!EESleepEx+0x2d
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x71
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9
000000da`6bf0de40 000007fb`41630f39 App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe

000000da`6bf0f750 000007fb`a0d1970a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
000000da`6bf0f7f0 000007fb`a0dcf3a4 mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
000000da`6bf0f890 000007fb`bc53167e mscorwks!Thread::intermediateThreadProc+0x78
000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21

到此我们可以证明我们的想法了,原来ASP.NET启动了一个RequestTimeoutManager来负责查看超时线程,然后通过发送APC的方式打断超时线程使之抛出ThreadAbortException结束等待,最后ThreadAbortException被包装成HttpException抛出,记录在日志以及显示在客户端。

调试方法

现实过程中Timeout不一定是出在线程Sleep上,也不是所有的ThreadAbortException都代表Timeout的问题发生。我们还是要总结一下类似问题的调试方法。

第一,观察timeout的时间,是否是100秒左右,与配置中的数值是否相符(注意这里比一定是一秒不差,因为查看RequestTimeoutManager逻辑可以发现他是每隔15秒查看一下超时线程,所有存在15秒左右的误差。

第二,如果不是超时时间设置过小的原因,那就要通过抓dump的方式来查看具体的超时请求超时在什么操作上。因为我们最终得到的ThreadAbortException中没有调用栈信息。

如何抓到对的dump是个值得研究的问题,我们可以把断点设在以下调用栈上的某个点上,注意断点越接近栈顶,这个方法可能被调用的次数就越多,抓到没有用的dump的可能性越大。

0:022> kL
Child-SP RetAddr Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167

000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

通过ILSpy查看CancelTimedOutRequestsTimeoutIfNeeded源代码,可以看到进入这个方法并不一定代表有线程会被abort,所以我们要从下面一层抓。最终结论是如果我们想要研究TimeoutThreadAbortException,我们可以mscorlib_ni!System.Threading.Thread.Abort上面设置断点抓dump,然后查看堆栈上面的线程,最终找到正在被abort的线程以及相应执行的调用栈信息。

// System.Web.RequestTimeoutManager
privatevoidCancelTimedOutRequests(DateTime now)
{
if(Interlocked.CompareExchange(refthis._inProgressLock,1,0)!=0)
{
return;
}
ArrayList arrayList =newArrayList(this._requestCount);
for(inti =0;i<this._lists.Length;i++)
{
DoubleLinkList obj;
Monitor.Enter(obj =this._lists[i]);
try
{
DoubleLinkListEnumerator enumerator =this._lists[i].GetEnumerator();
while(enumerator.MoveNext())
{
arrayList.Add(enumerator.GetDoubleLink());
}
}
finally
{
Monitor.Exit(obj);
}
}
intcount = arrayList.Count;
for(intj =0;j<count;j++)
{
((RequestTimeoutManager.RequestTimeoutEntry)arrayList[j]).TimeoutIfNeeded(now);
}
Interlocked.Exchange(refthis._inProgressLock,0);
}

// System.Web.RequestTimeoutManager.RequestTimeoutEntry
internalvoidTimeoutIfNeeded(DateTime now)
{
Thread thread =this._context.MustTimeout(now);
if(thread!=null)
{
this.RemoveFromList();
thread.Abort(newHttpApplication.CancelModuleException(true));
}
}



分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics