Many people have heard of Moore's Law, that "the number of transistors that can be placed inexpensively on a microchip will double every two years" and assumed that it drove the performance improvements in microchips over the past forty years.

Only recently did I learn the role that computer simulation and queueing theory had in the performance improvements. It turns out that microchips were really inefficient at first. For example,

The microchip runs faster than the hard drive and non-cache memory can deliver inputs... even today!
Tasks often get stuck behind other tasks, so a 'clog' can drastically impact performance.
Very often the microchip has nothing to do, so the time is wasted.
You can run the exact same routine a dozen times, but the processor didn't realize it could use the answer already calculated.

Next I'll describe some of the efforts made to correct these issues.

Scheduling Improvements

Pipelining: CPUs spend much of their time doing arithmetic, so Intel created one pipeline for arithmetic computations and another pipeline for all non-mathematical operations. This proved so successful that most modern microchips have eight different pipelines. (Meyers 141)
SRAM/L1 Caching (SRAM is 10x-50x faster than regular RAM): The CPU is so fast that regular RAM can't feed it information fast enough, which forces the CPU into wait states. To avoid wait states, Intel put SRAM in Pentium chips to feed the CPU the most commonly calculated pieces of data. "This SRAM would preload as many instructions as possible and would keep copies of already run instructions and data in the hope that the CPU would need to work on them again. The SRAM cache inside the CPU was tiny, only about 16KB, but it improved performance tremendously." (142)
L2 Caching: Caching improved performance so much "that motherboard makers began adding a cache directly to the Pentium motherboards." (143) This L2 cache eventually migrated onto the chip from the motherboard. "It's tempting to ask why processor manufacturers didn't just include bigger L1 caches instead of making onboard L1 and L2 caches. The answer is that a very small L1 and a larger L2 are much more efficient than a single fast L1." (143)
Branch prediction: "a process whereby the CPU attempted to anticipate program branches before they got to the CPU" (143) "The [Pentium Pro] improved on the Pentium's branch prediction by adding a far more complex counter that would predict branches with a better than 90-percent success rate." (146)
Superscalar Execution: "The ability to run more than one instruction in any one clock cycle." (145)
Out-of-order processing: "When the [Pentium Pro] was forced into wait states, it took advantage of the wait to look at the code in the pipeline to see if it could run any commands while the wait states were active. If it found commands it could process that were not dependent on the data being fetched from DRAM, it ran these commands out of order, a feature called out-of-order processing."
Clock Speed Multiplying (up to 30x improvement): The CPU can run between 2x and 30x faster than it can communicate with the rest of the computer.

What is simulation and how can it create advantage?

People have used mathematical models of complex systems for decades. The vast majority of these equations rely on assumptions such as memorylessness and the independence of arrival times. These models are fantastic at representing human behavior, financial markets, physics, and natural phenomena but are poor at representing the traffic inside of a computer network. So how do we address this problem? With simulation.

Computer simulation is the use of mathematical models to imitate real world phenomena. It is much cheaper to model a system digitally than to physically perform experiments, and there are many occasions where the mathematical models cannot be solved by hand. The solution is to therefore model the system, run the simulation a million times and examine the distribution of outcomes. Better models will produce a better distribution of outcomes and vice versa.

Commentary

Non-Intuitive Results: You can convince me that the L1 Cache intuitively makes sense. It's nonintuitive that an L1+L2 is faster than a big L2 though, and I still have difficulty believing that 8 pipelines is faster than 1 for any reason beyond load balancing. Based on these examples, I'd say that simulation is necessary to overcome the limits of intuition when more elegant mathematical solutions also fail.

Competitive Advantage: Beyond looking at the benefits in hardware terms or product quality, I think we need to ask, "What would have happened if Intel or AMD implemented a queueing improvement that the other didn't realize?" Depending on the time required to retool a factory, such a mistake may have killed them.

Other Reasons I Like Simulation:

Simulation is a form of experimentation.
Simulation is a form of testing.
Simulation is cheap.
Simulation speeds up product development.
Simulation improves learning.
Simulation is the easiest way to model and understand complex systems.
Simulation is the easiest way to model dependent systems.

QM-ROI: By my count, queueing theory improvements were responsible for half of the performance improvement in chips over the past 40 years, and queuing theorists are relatively cheap in comparison to a $5 billion dollar chip plant. Needless to say, anytime you're investing a billion dollars in a physical system that involves scheduling: get expert advice. It could save you millions in factory productivity, hardware performance, logistics expense, etc.

http://en.wikipedia.org/wiki/Moore%27s_law

http://en.wikipedia.org/wiki/Poisson_process

http://en.wikipedia.org/wiki/Computer_simulation

http://en.wikipedia.org/wiki/Branch_predictor

http://en.wikipedia.org/wiki/Out-of-order_execution

http://www.azcentral.com/business/articles/2011/02/18/20110218chandler-arizona-announces-new-Intel-facility.html