- 浏览: 2103825 次
- 性别:
- 来自: 杭州
文章分类
最新评论
-
无心流泪wan:
private static final Log log = ...
log4j Category.callAppenders Block -
yjxa901:
博主好: http://www.java.net/down ...
jdk debug -
aptech406328627:
大神,请接收我的膜拜吧,纠结了两天的问题,就这么让你给解决了 ...
java.lang.reflect.MalformedParameterizedTypeException -
xukunddp:
谢谢1楼,我也遇到,搞定了
java.lang.reflect.MalformedParameterizedTypeException -
di1984HIT:
学习了!!!!
jvmstat hsperfdata java.io.tmpdir
转载自:http://www.xatlantis.ch/education/multi_computer.html
Index of Multiple Processor
1. Multiple processor systems
Computer systems must become faster and smaller, to meet the requirements of the industry. But unfortunately we are beginning to hit some fundamental physical limits on clock speed. Since no electrical signal can travel faster than the speed of light (30cm/nsec) in vacuum and about 20cm/nsec in copper wire of optical fiber the size of a processor is limited:
- for 10 GHz processor the signal can not travel more than 2cm
- for 100 GHz processor the signal can not travel more than 2mm
- for 1 THz processor the processor must be smaller than 100 microns
We face an other huge problem. The heat dissipation. To get rid of those problems and still have a great improvement in speed is to build parallel computers or biological computers.
1.1 Parallel computers
Parallel computers are built with conventional hardware (processors) connected together using bus system. All communication between the cores (processors or computers) are realized using messages.
There are 3 general types of multi processor systems:
Shared-memory multiprocessor. This system is invisible to the programmer. The message passing is done under cover. The implementation is not very easy, and there are some limits as well. Accessing a memory word takes about 0.002-0.01 µsec. |
|
Message-passing multicomputer. The processor and memory pairs are connected to a fast high speed interconnect. The system is also called message passing multicomputer. Implementation is much simpler than the memory shared multiprocessor. Accessing a memory word takes about 10-50 µsec. |
|
Wide area distributed system. This system connects complete computer systems over a wide area network, such as Internet. It forms a distributed system. Since the accessing is rather slow the systems are loosely coupled. Accessing a memory word takes about 10'000-100'000 µsec. |
1.2 Multiprocessors
A computer system witch contains more than one processor accessing to a common RAM is called Multiprocessor.
Regular operation systems can be used with some extended features such as
- process synchronization
- resource management
- scheduling
There are two major groups of multiprocessors:
- UMA (Uniform memory access). Here every memory word can read as fast as any other memory word.
- NUMA (Nonuniform memory access) does not have this property
1.2.1 Bus based architecture for UMA Multiprocessors
This is the simplest way. All processors and one or more memory module are using the same bus for communication.
1.The CPU waits until the bus is ready
2.Put the address of the word on the bus and waits until the memory puts the requested memory word on the bus
3.If the bus is busy the CPU has to wait.
The major problem is step 3. The system will be totally limited by the bandwidth of the bus. To get a better performance each processor gets its own cache. Some of the reads can be satisfied over the cache. This reduces the usage of the bus.
The cache handling must be extended.
- The cache block is marked as read only if it is present in multiple caches
- The cache block is marked as read-write if it is not present in any other caches
If a CPU wants to write a word witch is in more than one cache (at least one remote cache), the bus hardware detects the write command and puts a signal to the bus informing all caches of the write. If one of the caches has a "dirty" entry (already modified) the cache must write the modified copy back to the memory before the current cache can read and modify the word.
Even with the best caching the numbers of CPU are limited to max. 32 CPU's.
1.2.2 Using Crossbar switches
Using crossbar switches up to 100 CPU's can be connected. The crossbar switches connects n CPU's to k memories. For more CPU's the network becomes to complicated and very expensive, since n*k switches are needed.
A crosspoint can be opened or closed. If the crosspoint (i,j) is closed the i-th CPU is connected to the j-th memory block.
Using this architecture many CPU's can access the memory at the same time (parallel). The crossbar switch is a non-blocking network. If a CPU A tries to access a memory block already accessed by an other CPU B, the CPU A has to wait until CPU B finishes the access.
One of the biggest disadvantages is that the network grows as O(n^2).
1.2.3 Using Multistage Switching Networks
A different multiprocessor architecture can be achieved using the humble 2x2 switch.
The message arrives either on input A or B and is routed according to some header information to the output X or Y. The message header is composed by 4 fields:
- Module: Tells witch memory to use
- Address: Specifies an address within the module
- OpCode: Operation (read or write)
The switch looks at the Module-field and decides if the message should be sent to output X or Y.
Using the 2x2 humble switch larger networks can be built. One possibility is to build an omega network (also called perfect shuffle). The omega network using n CPU's and n memories log2n stages are needed, with n/2 switches per stage. Instead of n2 swiches (for crossbar switched network) only (n/2)log2n switches are needed to build the omega network.
The image illustrates how 2 CPU's A (001) and B (011) accesses two different memory blocks M (001) and N (110).
The CPU A puts 001 for memory module 001 onto the Module-Field of the message. The first stage looks at the first bit witch is 0 and activates the output line 1B.X. The second stage 2C analyses the 2nd bit witch is also 0 and activates the output line 2C.X. The last bit is analyzed by the last stage. The 3A switch activates the 3A.Y output since the last bit is 1.
If an other CPU wants to access the memory 001 it has to wait. Therefore the omega network is a blocking network.
1.2.4 NUMA Multiprocessors
To connect more than 100 CPU's together an other approach is needed. Like UMA architectures the NUMA uses also a single address space across all CPU's but the memory is departed into local and remote memory. Accessing to local memory is therefore much faster.
There are two types of NUMA architectures: The NC-NUMA (no-cache NUMA) and the CC-NUMA (cache-coherent NUMA). Most popular is the CC-NUMA using a directory based architecture. The idea is to maintain a database telling where each cache line is and what status it has.
1.3 Operation System Types
There are different possibilities to handle multiprocessors by a operation system.
- Each CPU has its own OS. This is the simplest way.
- A system call of an application is automatically handled by the correct OS of the same CPU
- The scheduling is done at each processor itself.
- Each CPU has its own memory (static assigned)
- IF the OS maintains buffer caches (of recently used disk blocks) the buffer might be inconsistent, since other CPU's might change content on the disk. Buffer caches should not be used.
- Master-Slave Multiprocessors. The OS runs on the master and all
client processes are running on slave processors. If a processor is idle
it can ask the master for a new process to run.
- Can not happen that one slave is overloaded and one is idle.
- Buffer cache is maintained only by the master processor (OS)
- All system calls are handled by the master processor. The master processor becomes a bottle neck for more than 10 CPU's.
- Symmetric Multiprocessors. The OS is in memory and every processor can run it.
- OS must be redesigned and independent sections must be locked by a critical section
1.4 Multiprocessor synchronization
On a single CPU machine after calling a system call, the system has to access a table containing the critical section locks. To do that it was sufficient to disable the interrupt handling before accessing the table.
On a multiprocessor system this simple mechanism will not work, since other processors can still access the table. Therefore some other synchronization is needed. A proper mutex protocol must be used and respected by all CPU's to guarantee that the mutual exclusion works.
- Synchronization using TSL (Test and Set Lock). Needs 2 bus cycles and is therefore not a atomic operation. In worst case more than one CPU can read and set the lock. The mutual exclusion fails.
- Synchronization using TSL and bus lock. First the TSL instruction
locks the bus, reads the word, compares and might write back a non zero
value. At the end the bus is unlocked. This prevents other CPUs using
the bus at the same time. Since each CPU is looping over this lock (spin
lock) a massive load is put on the CPU and the bus.
- To optimize the bus traffic a preread can be done, where as the CPU checks first if the lock is free before using the TSL.
- Also possible is a waiting algorithm (like Ethernet does). First it waits one instruction, the two, four etc. up to a maximum.
- FIFO of waiting CPUs looping on a private lock. The first CPU in list is the lockholder. If it finishes it releases the lock of the next CPU. This concept is efficient and starvation free.
Note: TSL Instruction
TSL (Test and Set Lock) instruction reads the content of a memory word into a register and then stores a nonzero value at the memory address. The operation is indivisible. It locks the bus until it finishes so no other CPU can access the bus at its execution time.
TSL R0, lock ;
... ; Do some work here
MOVE lock, #0 ; Release lock
发表评论
-
eclipse classpath太长的问题
2013-07-19 21:53 2941https://bugs.eclipse.org/bugs ... -
linux 检测工具
2013-07-17 00:52 1158sysstat http://sebastien.goda ... -
svn: 目录中的条目从本地编码转换到 UTF8 失败
2013-01-24 13:28 3682测试同学写了中文类名和方法的testCase,导致svn下 ... -
linux trace工具
2013-01-22 10:59 7792技巧: 使用truss、strace或ltrace诊断软件 ... -
linux 命令 图像
2013-01-05 10:31 979通过命令行处理图形 http://www.ibm.co ... -
AWK & SED
2012-11-15 20:40 888Sed学习笔记 http://www.tsnc.edu ... -
SEDA
2012-11-08 19:02 18041:Staged Event Driven Architect ... -
linux ulimit
2012-10-27 19:14 1473选项 [options] 含义 例子 -H ... -
收集的一些mysql相关的文章
2012-09-25 11:56 9991:Linux and H/W optimizations f ... -
linux 内存屏障 volatile
2012-08-19 16:19 3353之前主管解释了内存屏障之类的东西,但是还需要一些理论来补充,故 ... -
GDB 调试相关
2012-08-19 12:57 2706之前利用gdb查看内存数据,定位到了一个内存泄露的问题,但是 ... -
Uninterruptible sleep
2012-07-12 00:55 1628今天关于load问题学习到一个新名词 Uninter ... -
linux下图片转换为pdf
2012-07-03 22:38 8504linux下将图片转换为pdf,linux下刚好有现成的工具 ... -
Linux下mms下载
2012-06-25 01:38 1563遇到mms协议的视频文件,找到了linux下的下载工具 ... -
bash for循环
2012-06-08 15:18 89612 Bash For Loop Examples fo ... -
linux 零拷贝技术
2012-04-12 15:14 2000收集整理一些关于linux 零拷贝技术的文章,如果想高效的收集 ... -
linux 安装Systemtap
2012-04-06 18:19 5857在之前的blog里介绍了一堆systemtap的资料,然后之前 ... -
linux Kprobes
2012-03-31 18:45 1407觉得Kprobes很神奇,故找些资料来学习下 1 ... -
Linux 可加载内核模块
2012-03-30 20:02 1357上几篇文章里都涉及到动态监控,其中使用到了动态模块加载的技术, ... -
linux Systemtap
2012-03-30 15:30 2057上篇文章总结了ftrace的一些学习资料,这里给出另外一个工具 ...
相关推荐
逻辑电路设计、真值表、k-map。cpu设计、pipeline。single cycle cpu转multi-cycle cpu
The document "Introduction_to_Optimum_Design.pdf" delves deeply into the principles and methods of optimum design, which is a critical area within engineering and applied mathematics. This field ...
- **多核处理器(Multi-core Processor)**:指在单个芯片上集成了多个独立的处理器核心,可以同时执行多个任务,显著提高计算性能。 - **比特(Bit)**:二进制数字中的基本单位,只包含两个可能的值:0 或 1。 ### 3....
《IEEE Standard for Design and Verification of Low-Power Integrated Circuits》是 IEEE Computer Society 和 IEEE Standards Association Corporate Advisory Group 赞助的一项标准,旨在为低功耗集成电路(Low-...
《Computer Organization and Design Revised 4th Solutions》这本书作为该领域的经典教材之一,为学生和从业者提供了深入学习的资源和实例解答,帮助理解计算机系统中最为核心的概念。 从所给的内容来看,这份文档...
6U based system architecture to support the application needs of the Computer Telephony (CT) industry, by providing CompactPCI system vendors and CT board vendors with specifications that will ...
Apply the 3Cs framework to the broader realm of the Internet of Things and design multi device experiences that anticipate a fully connected world Learn how to measure your multi device ecosystem ...
The advent of pervasive concurrency has caused fundamental design changes throughout computer systems. In a bid to offer faster and faster machines, designers had been pro- ducing hardware with ever ...
Design and develop advanced computer vision projects using OpenCV with Python About This Book Program advanced computer vision applications in Python using different features of the OpenCV library ...
Data Structure And Algorithms : Algorithm Theory - SWAT 2002 - M....The Design And Analysis Of Spatial Data Structures - Hanan Samet The Tomes of Delphi Algorithms and Data Structures - Julian Bucknall
Create your own innovative applications in computer vision, game design, music, robotics, and other areas by taking full advantage of Kinect’s extensive interactive, multi-media platform. With this ...
Using interactive graphic elements can significantly reduce the amount of text items, which reduces time and cost to translate a text document with the release of her multi-lingual. Compatibility: 3...
In conclusion, this graduation design focuses on creating a user-friendly, reliable, and high-speed wireless communication system for multi-point temperature monitoring. The integration of Wi-Fi ...