转:Java Performance Tuning on Linux Servers

jimcgnu

浏览: 34777 次
来自: ...

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (24)

社区版块

存档分类

performance Linux Java JBoss JDK

by Andrew C. Oliver

18-Sep-2006

At JBoss, I was asked to help write some training materials for "performance tuning JBoss/Java on RHEL/Linux". It wasn't a very easy task because I knew the audience would primarily be composed of administrators who might not be interested in the whole system compounded with the fact that most people mean performance and scalability when they say performance. You see, what I would do to make one single client connect and perform its operations as quickly as possible on a single server is inherently very different from what I'd do for 10,000 users connecting to a cluster. Moreover, the type of performance tuning that I do for an application with no users and all messaging is very different from what I do for a standard web application-type system.

However, the methods I'd use when designing a high-performance system and determining what subsystems need tuning and what options I'd select are fairly universal. Moreover, there are some common places which can be attended to and some very nasty common snags that can be avoided.

Let me preface this article with some assumptions. I assume that you're fairly close to the beginning of your journey and that your system looks more like the second scenario (high scale web application) than the first. Moreover, your system is more of a transaction processing system than OLAP or messaging system, though some of this is fairly universal. Lastly, while JDK command-line settings all derive more or less from the Sun JDK's settings, they are different (sometimes merely by a letter). When I list them, I use the Sun settings because it is the most widely deployed and all other JDKs are at least influenced by it.

Setting Goals

A good way to determine when a journey is over, is to pick a destination before you begin. In other words, the only way to meet your performance goals is to actually have stated specific performance goals. Whether this be "under 3 second page loads on the LAN" or specifics for particular subsystems. Ideally you have fairly granular performance goals for pieces of your software (for instance query/data retrieval performance is key) that can be tested on a target system (i.e., RHEL 4.0, Itanium II/3ghz with 8gb memory, SATA2 10000rpm -- getFooBarDataFromDatabase() returns in 3ms).

However, having jUnit tests that measure method execution performance isn't enough. Java/Linux systems have non-deterministic performance. For one, Linux is not a real time operating system (generally) and Java runtimes are generally not real time. Additionally, concurrency causes contention for resources: threads compete for processor time and synchronized locks on resources; database locks and disk utilization. In order to really set goals we have to set concurrency goals such as the maximum and average number of logged in users and performance expectations for the criticial components under that load.

Load test tools

Having goals is a fine thing, but we must have ways to measure adherence to them. Moreover, we should have ways to measure the performance of different aspects of our system. Its easy to get an idea of raw page delivery with ab, the Apache HTTPD Server benchmarking tool, but that doesn't tell us enough about whether 300 users can log in at once when 1000 users are logged in (one of our goals). For that you need a more sophisticated tools such as the proprietary tools from Mercury, the more affordable Web Performance Suite, the open source (but Windows only) OpenSTA, or the ever-popular (albeit primitive) and multiplatform (Java) Grinder. You can find other open source alternatives at http://opensourcetesting.org.

Generally you want tools that can both record and execute load scripts, preferrably with parameters (as cut and pasting scripts for 10000 unique login names would be tedious at best) and ideally text-based (so that you have the option to cut and paste, or generate with a Perl script, 10000 unique logins). You also want unit test tools, there are other ones, but jUnit is a basic tool against which all others tend to be measured. Your unit tests can include temporal expectations, but you may need to code some tolerance for when you run these on a system other than the target system (my laptop might not perform the same as a high end server).

Profiling tools

Using load test tools and performance tools is a fine thing, but if you can't figure out exactly where the problem is, then they aren't much use. There is the venerable System.out.println method, but more sophisticated users use profiling tools and network analyzers. I'm sure other network analyzers exist but the only ones I'm very familiar with are the (command-line and opaque) tcpdump and the (very friendly but GUI) ubiquitous Ethereal. I suggest having a UNIX or GNU/Linux-based system nearby even if you develop on Windows as there is no easy/reliable way to spy the loopback adapter on Windows.

On GNU/Linux this is easy with Ethereal (see Figure 1). Ethereal helps you see WHAT exactly and HOW MUCH is being passed on your network. It's also great for laughs in airport terminals on the wifi to find out just how much pookie loves schnookums and why (ironically unveiled by switching to promiscuous mode).

Figure 1: Ethereal in action

To analyze your code a little more closely there is always the JVM -Xaprof option, but for a more sophisticated view you can use the (closed source and proprietary) JProbe, the very sophisticated (but expensive, difficult to use and closed source) Wily Introscope or one of any number of Java profilers including the yet unmentioned JBoss Profiler.

Topology

Obviously you need to make sensible choices in the way you write your code and use profiling and testing tools to make sure your code performs to your expectations and find potential bottlenecks, but what about your physical system? The first aspect that needs to be addressed is your network topology. A very common topology is to physically separate each aspect of the system in to separate processes on separate physical machines (See Figure 2). Thus every request is processed by a load balancer, a web server, a servlet container, an Appserver/EJB tier and a database server. Often times, this is coupled with some network rules aka "demilitarized zone" and rationalized as a security decision.

Figure 2: The simple layered architecture

The problem with this topology is that with the intent of preventing less common types of attacks (deliberate crackers), it exacerbates the most common types of attacks (denial of service) and does nothing for some of the simplest attacks (SQL injection). Generally speaking on a web based system the first line of defense is your load balancer and its inability to pass anything more than HTTP. Any cracker who can get past it can likely compromise your HTTP server. Moreover, the more layers of network IO you add, the greater the performance requirements on each part of the system to process any given request, thus using the above mentioned load test tools, you and 10 of your friends can probably take down many sites that practice this form of "security" (please don't) by simply exercising the login process (which is often times the most laborious and most susceptible to denial of service attacks).

An ideal topology would be one in which there are redundant load balancers (for high availability) and one layer of identical nodes which include the web, servlet, EJB and database all in-process utilizing operating system threads (this breaks down at some level and we need more advanced scheduling but that is a much longer article). Additional scalability would be achieved by adding one more of the identical systems. This isn't completely practical with today's technology and with many datasets it is simply impossible. Therefore our closest ideal is a set of redundant load balancers, multiple web/appserver boxes (running in the same process with operating system threads) and some sort of HA database solution (from Oracle's RAQ to MySQL's clustering). Figure 3 shows the architecture of this setup.

Figure 3: The more efficient layered architecture

For clustered systems we want to separate our network traffic onto separate backbones. Ideally there should be a separate network interface card in each system for incoming client traffic, cluster replication data and backend (database) communication. The communication to each should be bound strictly to those cards (trivial to do with most open source application servers and Linux) and should pass on a separate backbone. These days gigabyte ethernet is fairly cheap and common place and should be the default for new systems. Older systems may benefit from an upgrade. The rationale for this setup is not only performance, but ease of problem determination and security as well (easier to firewall dedicated network backbones).

Edge Caching

If you stick to this topology advice (in a nutshell less IO is more performance and scalability), you're going to have a question in here about caching and more particularly edge caching (see Figure 4).

Figure 4: Architecture with edge cache

There is no clear cut advice here. Theoretically provided equal network performance edge caching service like those provided by Akamai Technolgies should slow your overall client-side throughput down. Obviously this is often not the case. Why? Because as of HTTP 1.1 it is no longer necessary for each page, image or object to require a seperate network connection request/response/close cycle. A client can request several pages in over one connection, moreover its common for load balancers to use persistent connections to the web server tier. By using edge caching you embed images from a different domain in your HTML output. This means that the client must request those images in a separate request thus driving the real system cost of delivering the page up.

However, all things often are not equal and thus you may find that edge caching does indeed perform better. The only way to know for sure is to load test from an offsite location (preferrably network-geographically near your largest concentration of users). In going with an edge caching solution, ensure that it is not merely a matter of increased CPU speed or that you're not merely sidestepping a slow load balancer.

Network Issues

Linux has a virutal cornicopia of network tuning options from the hardware level on up the stack. For many medium grade systems the default options are just fine. However higher end systems may benefit by changing the network buffer size and quite possibly the maximum transmission unit (MTU) to a larger number on your database and cluster backbones. The MTU is the maximum size a packet can be without being split. The MTU on the internet is more or less fixed due to defective specs for discovery and firewall rules to about 1500 which was fine for 10-base-T but is dated on gigabit ethernet. It is suggested that an MTU of about 9000 is probably a good tradeoff between safe and optimal. You need to be careful when setting this and ensure that your routers and other networking equipment is configured to handle larger transmission units or you may take yourself off the network. This can be done permanently but the method for doing so differs among various Linux distributions.

The following code changes the buffer sizes in the /etc/sysctl.conf file.

# TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

You can also change the tcp MTU at the command line.

# ifconfig eth0 mtu 9000

Threading Issues

Linux was not always a very good operating system for running Java software. Early Java implementations on Linux used "green threads" which cannot scale to multiple processors. Later implementations used lightweight processes (which is what Linux provided in the way of "thread support") which performed poorly compared to thread-based systems. Today's Linux distributions (2.6 kernel and later) support NPTL threads. These threads are lighter weight but can scale across processors. Red Hat and other distributions previously had a backport of NPTL threads to 2.4 kernels, but these backports often caused instability. Ensure that you are running a 2.6 kernel (uname -a is usually sufficient) preferably with a JDK 5 distribution (most large scale Lintel deployments are on the Sun or BEA's JRockit JDKs or close derivatives) but at least a 1.4.x distribution (we'll discuss why 5 makes a big difference later).

Memory In Java

The Java language and runtime are heavily premised on the concept of garbage collection. Unlike typical C and C++ applications, in Java you do not allocate and free memory, the system does this for you. There are multiple different types of memory in Java, two of which you can tune/control. The first is "the heap" or the Java heap. This is where your objects that are not primitive stack variables are allocated. The heap is divided into segements which we'll discuss momentarily. The second type is "the stack" which is where the aforementioned primitives and the call stack are allocated. The heap is garbage collected, the stack is allocated on a per-thread basis.

You can set the maximum heap size on the command prompt when starting Java (or often in the shell script which starts your application server) by passing -Xmx1g (1 GB for example). If your software requires more than this (with space for garbage collection) then you'll experience an OutOfMemoryError. There is no good way to determine the exact amount of memory a Java program requires, therefore testing is essential. It is suggested that you also set the minimum heap size to the same as your maximum heap size on larger production systems as if your heap utilization grows at peak then a performance spike may occur, moreover some of the really nastiest intermittent stability bugs have been in the interaction between heap resizing and garbage collection.

The heap is divided into "generations" which are then cleaned differently (see Figure 5). There are different garbage collection algorithms for different situations. For most large systems either parallel or concurrent garbage collection is optimal. Generally parallel is suggested as concurrent is more susceptible to issues of memory fragmentation and thread contention. The first generation is often called the "new generation" or "eden". Objects start here. When eden fills it is cleaned in a "minor collection" in which all live objects are moved to the "survivor spaces" and objects which have survived a few iterations are moved to the "tenured generation". After the heap is (by default) 68% full a major collection takes place and the tenured generation is cleaned. If insufficient memory is cleaned then a "full collection" occurs. Additionally there is a "permanent generation" which is allocated in addition to the heap specified via the -Xmx option. This is where class definitions go (and thus if you get OutOfMemoryError on redeployment it is most likely here) which is specified using the -XX:PermSize=64m option.

Figure 5: The JVM heap structure

JDK 1.4.2 introduced parallel and concurrent collection of a beta quality which was disabled by default. Unfortunately, it frequently core dumped or experienced other stability issues. JRockit offered more stable support for parallel and concurrent collection and often performed better. As of JDK 5, both JRockit and the Sun JDK are close contenders in the area of performance and in my experience which performs better is very dependent upon the underlying application. Some anecdotal evidence suggest that systems with low contention and large object sizes may perform better with the Sun JDK and systems with high contention and many small object sizes may perform better with JRockit. The garbage collection methods in each differ somewhat, but understanding the Sun model makes it easier to understand other GC models. In JDK 5 much less heap tuning is necessary and you're probably okay just using the -Xmx and -Xms options to size it. However sometimes the JDK guesses "wrong" with its autotuning and you must fix things. It can also be helpful to fix size the new and perminant generation or explicitly state a preference for parallel garbage collection. A smart thing to do is to test with gc logging outputted to a file and test different options during your load test. A complete set of options for Sun's JDK can be found here. You should purchase machines with parallel garbage collection in mind (i.e., at least 4 total cores and the larger the heap the more cores you need).

Next, on 32-bit lintel machines the maximum heapsize that you can reliably set is close to 1GB with default settings. This is because the max RSS is 2GB and the perm space, JDK overhead and thread stacksize * number of threads are all in addition to your -Xmx setting. On 64 bit systems (using the -b64 flag) you have theoretically many exabytes of address space (giga, tera, peta, exa). You can up this RSS to 4GB by using large memory pages (which are avaiable with the 2.6 kernel). You can find instructions on this here. For new hardware purchases you should ensure that it at least supports EMT64 (Intel's name for AMD's x64) or x64 (AMD).

Finally, be mindful of the thread stack size of the JDK on your platfom. The default on 32 bit Lintel is 512k per thread. The default on a 64 bit system is 1MB per thread. Those may sound like relatively small numbers but you can easily have 1000 threads on a busy application server. That is 512MB. If this is a 32-bit JVM and you've sized the heap relatively large you may get "cannot create native thread" or an out of memory error that is not connected to actual heap memory. Given that not to many years ago the stack size was like 16k, you can probably afford to lower this. 256k is usually the "safe" number I give to JBoss customers but often when tuning onsite I use 128k (which I validate). Be aware that the -Xss option can only make the stack size bigger. You will have to use the -XX:ThreadStackSize option to make it smaller. It is of course essential to load test this change. You may find that your code uses more stack than you thought!

Database Issues

The most obvious database issue to attend to is your connection pool size. Setting this is application server dependent. You really WANT a connection pool for any serious system. Also be warned that there are several buggy ones out there in open source (as well as closed source). However something else you may want to attend to is your database selection, its locking strategy and its isolation level (usually configurable in the appserver). All affecting this is whether or not your developers truly understand transactions and have utilized proper transaction demarcation techniques (chances are they don't) as well as "flush strategies". However that would make for a much longer article. Oracle and Postgresql offer Multi-Version Concurrency Control, which when used properly (in Hibernate and EJB3 this is done with "versioning") can greatly increase concurrency though optmistic locking. In fact to get a pessimistic lock when MVCC is used you typically have to issue the equivalent of a SQL select for update statement. MySQL 4.x(innodb) offers MVCC for reads only (writes still aquire a lock). DB2 and Informix among others, offer pessimistic row locks or page locks (depending on the platform). This can very seriously affect how your code is written and you likely will have to do some really tricky things to deal with concurrency. Theoretically MVCC is less efficient as it requires more copying, but in actuality it is more efficient under concurrency. Consider how frequent contention is when selecting a database and consider its locking startegy.

Conclusion

This is the tip of a much larger ice berg. I'd have loved to talk more about threading and contention and clustering strategies. I'd have loved to talk about the most common concurrency and performance scalability horror stories and more on how to diagnose these things, but "the man" said that this article could only be so long. So in conclusion, design for performance and concurrency, write tests, load test, select a sensible topology, test and tune your JVM's GC, run on a 2.6 kernel and buy more memory, a 64 bit machine with at least 4 cores (fine if thats in 2 processors or 4). Hope I haven't bored you. If you would like to chat more on this you can reach me at acoliver ot jboss dat org and often find me in the #jboss channel on irc.freenode.net.

More resources

About the author

Andrew C. Oliver is a professional cat herder who moonlights as a Java developer. He leads the JBoss Collaboration Server project and does various troubleshooting and performance tuning for JBoss customers. He has a wife, 3 kids, lives in Durham, NC and loves to read hex dumps. You can find out more about him on his blog.

分享到：

tapestry 对外json接口实现 | 各语言开源项目数目

2007-03-30 15:28
浏览 1839
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论