The Great Debate

Published on 2023-09-23.

Will compilers ever produce code as good as an expert assembly language programmer?

by Randall Hyde

This is a reproduction of a series of essays originally published by Randall Hyde here around 1996. These essays seem to be mostly lost to time, as even the Web Archive does not have a copy of them. A mirror can still be found on an old Case Western Reserve University computer architecture course's resources page, but the latter has not seen any update for over twenty years and could as well disappear tomorrow. I think these essays offer an interesting perspective on assembly language and programming in general, so they deserve to be preserved for the foreseeable future.

Table of contents

Do compilers produce code as good as humans? Will compilers ever produce code as good as humans? Is it worth it?
  1. Part I: Introduction, and answer to the basic question
    1. The "proof by example" myth
    2. The law of diminishing returns
    3. The architecture is king
    4. The "myopic" debator
    5. Is assembly language useful for writing portable code?
    6. Is assembly language easier to read than HLL code?
    7. Is assembly language harder to maintain than HLL code?
    8. Is assembly language easier to write than HLL code?
    9. Is assembly language practical on RISC machines?
    10. Is it easy to implement algorithm xyz in assembly language?
    11. How many people know assembly language?
    12. Will code optimized for one member of a CPU architecture be optimal for another?
    13. More to come...
  2. Part II: Economic concerns
    1. Economic concerns
    2. The "programmer's ego"-centric view of software economics
    3. The 90/10 (or 80/20) rule and other software engineering myths
      1. Myth #1a: You only need to rework 10% of your code
      2. Myth #1b: You only need to rework 10% of your code
      3. Myth #1c: You only need to rework 10% of your code
    4. The "rule of fifths" (the 20/20 rule)
    5. Assembly language isn't intrinsically hard to write
    6. Amortization
  3. Part III: Arithmetic
  4. Part IV: Fast enough isn't
  5. Part V: Levels of optimization

Part I: Introduction, and answer to the basic question

The Great Debate is a very emotional exchange that has been running continuously since the late 70s. On some newsgroup somewhere, you will find a thread discussing this topic; although you will have the best of luck looking at the Internet comp.lang.asm.x86, comp.lang.c, alt.assembly, or comp.lang.c++ newsgroups. Of course, almost anything you read in these newsgroups is rubbish, no matter what side of the argument the author is championing. Because this debate has been raging for (what seems like) forever, it is clear there is no easy answer; especially one that someone can make up off the top of their head (this described about 99.9% of all postings to a Usenet newsgroup). This page contains a series of essays that discuss the advances of compilers and machine architectures in an attempt to answer the above question.

Although I intend to write a large number of these essays, I encourage others, even those with opposing viewpoints, to contribute to this exchange. If you would like to contribute a well thought out, non-emotional essay to this series, please send your contribution (HTML is best, ASCII text is second best) to debate@webster.ucr.edu.


To begin with, I would like to address the topic of the debate itself. This is very important, because so many discussions about this subject quickly get off-track and people begin arguing about other questions still assuming that they are arguing about whether compilers can produce better code than humans. Here are some of the questions whose answers people try to apply to the above question:

  1. Is it cost-effective to write code in assembly language?
  2. Is assembly language useful for writing portable code?
  3. Is assembly language harder
    1. to read
    2. to maintain
    3. to write
    than HLL code?
  4. Is assembly language practical on RISC machines?
  5. Is it easy to implement algorithm xyz in assembly language?
  6. How many people know assembly language?
  7. Will compilers ever produce code as good as a newbie assembly programmer?
  8. Will compilers ever produce code as good as an intermediate assembly programmer?
  9. Will code optimized for one member of a CPU architecture be optimal for another?
etc.

While these questions are interesting in their own right, and many essays in these series will address the questions they raise, keep in mind that these questions may have different answers than the original question "Will compilers ever produce code as good as an expert assembly language programmer?" Switching the question mid-way through an argument is not a good way to win that argument; indeed, it's a tacit admission of failure.

So let's put to rest the "theoretical" answer to this question right away. "Will compilers ever produce better code than Humans?"

The answer is an obvious "NO!" The reason is quite simple; humans can look at the output of a compiler and improve upon it. Therefore, there is no theoretical reason for the statement "compilers will get to the point that they produce better code." People can always observe what the compiler is doing, learn from this, and adjust their code accordingly. Therefore, the absolute best that a compiler could do is produce code that is equivalent to the best code an expert assembly language programmer could produce.

"That's not fair..." you might claim. "Why should humans be allowed to look at the compiler's output?"

Well, life isn't fair. If we're talking about producing the absolute best code, it's perfectly reasonable for a compiler to use every piece of information at its disposal; likewise, it's perfectly fair for humans to take the same approach. And one piece of the information at a human's disposal is the output from a good compiler. The fact that compilers cannot respond in like, and rip off the good ideas from a human programmer is the primary reason why a compiler will not be able to better the output of a human being.

Note, by the way, I cannot absolutely defend the converse statement: "Will an expert assembly programmer always be able to produce better code than a compiler?" Conceivably, compilers could get to the point that this is not true. It's very unlikely that compilers could achieve this status, but it is imaginable.

Of course, if a compiler could always produce code that matches what the best humans could produce, one could argue that assembly language is obsolete purely on economic terms (unquestionably, it is generally more expensive to develop optimal assembly software than HLL code). The whole reason the argument continues until today is because compilers aren't even close to this goal.

Some related questions:

7. "Will compilers ever produce code as good as a newbie assembly programmer?"

They already do. Indeed, in most cases the output of a C/C++ compiler (whose input is reasonable quality C/C++ code) is quite better than the hand written code produced by someone who has just learned assembly language. Indeed, this is the reason most people feel compilers do such a good job today—they learn just enough assembly to put a working assembly language program together and then are surprised to find that a C/C++ compiler beats the pants off their code, despite assembly language's legendary performance benefits. Such programmers then spend the rest of their life praising the quality of compiler output, when, in fact, they do not have the experience to make such claims.

8. "Will compilers ever produce code as good as an intermediate assembly programmer?"

They already do. An average assembly language programmer can always beat a large number of compilers out there. However, there are many high-performance compilers that will beat the pants off an intermediate programmer.

"Will compilers ever produce code as good as an advanced assembly language programmer?" This is the question answered above (the answer is no). However, it is important to point out that an advanced assembly language programmer has to take considerable care. If such a programmer is careless or gets lazy, the compiler might produce better code. This is, perhaps, the best argument for using both HLL and assembly code in the same program. The advanced assembly language programmer can concentrate on the important sections of code and leave the lesser important pieces for the compiler to work on.

The "proof by example" myth

Often, someone will try to "prove" to me that compilers produce really good code by taking some HLL sequence of statements and showing me the outstanding assembly sequence that compiler produces. Folks, there is only one way to prove an always condition by examples; that's by enumerating all possibilities and showing the condition to be true for all such possibilities. Since there are (for all practical purposes) an infinite number of possible programs one can write, you will not be able to prove a compiler's worthiness by example.

Note that the general question is not "Can a compiler produce a code sequence that is as good as (or better than) a human would?" I consider myself to be an expert assembly language programmer. However, I am not ashamed to admit that I've learned some assembly language tricks by studying the output of various compilers. Just because I wrote some assembly sequence (without looking at some compiler's output) and you found a compiler that bests me doesn't make the compiler better than me. If you feed the same input to a compiler twice, you will always (assuming a deterministic program) get the same output. Give the same problem to an assembly language programmer twice and you're likely to get two different solutions. One will probably be better than the other. Have compilers ever beaten me? Yes. They do it all the time. But on different code sequences I beat the compiler every time. If I really apply myself, I can beat the compiler every time.

Another problem with the "Proof by example" myth is the fact that pro-compiler types will often use the output of several different compilers to boost their arguments. That is, given three or four different algorithms/code sequences, they may run the code through several different compilers and pick the best output. The problem with this approach is that no single compiler implements everything in the best possible fashion. Some compilers will excel in one area and totally suck in another. This tends to hide the fact that compilers often fail miserably at some things that a human would handle automatically.

The argument for this policy is simply "Well, if existing compilers can do all these good things separately, surely we can merge the best of these compilers into a single product and have something really great." This line of reasoning fails for three reasons:

  1. Some optimizations are mutually exclusive. That is, if you perform one type of optimization you cannot perform some other type of optimization on the code. If the "best" example from one compiler uses an optimization technique that is mutually exclusive with the "best" example from a different compiler on a different problem, it may not be possible to merge those two techniques into the same product.
  2. Don't forget that most compilers are commercial products. The quality of the optimizer is often a trade secret and other vendors may not be able to directly clone an optimization technique.
  3. Even if two optimizations are not mutually exclusive, putting the two of them into the same program could produce difficult to maintain code or severely impact the performance of the compiler.

Software engineers have been promising for 20 years now that compilers would merge all known techniques into a single product and we'd have really great compilers someday soon. Compiler have gotten better, but they're still a long ways off from perfect.

Note that it is possible to disprove a theory with a single example. Therefore, if you want to claim that compilers can always produce better code than humans, all I've got to provide is one example of the contrary. Proving that compilers, on the average, produce better code than an expert assembly language programmer is far more difficult.

The law of diminishing returns

Perhaps the best indication of how well compilers in the future will operate is the past. By looking at how well compiler have improved their code generation capabilities over the past several decades, we can anticipate how much better they will get over the next decade.

In the late 70s and early 80s there was a flurry of activity with respect to the production of optimizing compilers. The result was quite impressive. In ten years, the code from compilers for a given language doubled, tripled, or improved by an even greater percentage. Coincidentally, it was during this time that "Software Engineering" came of age and people began to move away from assembly language because compilers promised high performance with less work.

Unfortunately, compilers in the later 80s and early 90s failed to produce the dramatic improvements seen in the late 70s and early 80s. Indeed, major performance improvements during this time period came from architectural improvements to the CPU rather than any great advance in compiler technology. Whereas performance gains in the 100-500% area were common with the first wave of microprocessor compilers, the improvements dropped well below 100% in the second wave of products (late 80s and early 90s). Today, compiler writers are scratching the first to get gains of 15–30%. Computer architects aren't doing much better. Compiler writing is a fairly mature science at this point. It is very unlikely (short of someone proving that P=NP) that we will ever again see impressive gains in compiler technology with respect to raw performance improvement.

Therefore, extrapolating the past performance of compiler writers to predict how much faster the code will run that compilers will produce ten years from now is very dangerous. Unless there is a radical shift in computer architectures that favors HLLs at the expense of assembly language, it is unlikely the performance gap between good HLL programs and good assembly language programs will become much narrower. Indeed, the only real thing left to do is to consolidate as many optimizations as possible into a single compiler (we are a long ways off from this today). This will probably improve performance by another 50% on the average.

The architecture is king

As I mentioned in the previous section, most of the big performance gains over the past 20 years have been due to architectural improvements, not to compiler improvements. The mere fact that we've gone from a 5MHz 8088 to a 200MHz Pentium Pro in a high-end PC in 15 years has a lot more to do with the speed of software today than with the quality of compilers. While certain technologies, such as RISC, have closed the gap between human-based machine code output and compiler-based machine code output, the performance boost by compilers pales in comparison to that provided by the newer hardware.

The "myopic" debator

Another problem with contributors to the Great Debate is the limited exposure many people have. If you get involved in a thread arguing the relative merits of assembly language vs. C, you will often find the pro-HLL types leading the charge are UNIX programmers. Now I don't want to pigeon-hole all UNIX programmers, but the types I've seen making the argument against assembly language have very little experience outside the UNIX (or mainframe) O/S arena. I think that one could make a very good case that assembly language is a bad thing to use under UNIX. Does that mean assembly language isn't useful elsewhere? Gee, some programmers wearing UNIX blinders sure seem to think so.

Before you start coming up with reasons why assembly language is not a practical tool, make sure you state the domain in which you operate. Claiming "Code doesn't really need to get any faster" or "We don't need to worry about saving memory" are fine arguments when you're working on a 500MHz DEC Alpha with 1 GByte main memory installed. Are the claims you're making for your environment going to apply to the engineer trying to convince a Barbie doll that it should talk using a $0.50 microcomputer system? Keep in mind, it's the C/C++ (and other HLL) programmers arguing that you should never have to use assembly. The assembly programmers never (okay, rarely) argue that you should always use assembly1. It is very difficult to defend a term like "never". It is very easy to defend a term like "sometimes" or "occasionally." Just because you've never been forced to use assembly language in order to achieve some goal doesn't mean it is always possible to avoid assembly. Be careful about those blinders you're wearing when arguing against assembly.

Is assembly language useful for writing portable code?

Okay, it seems like a stupid question. Obviously, any code written in assembly language is going to have a difficult time running on a different processor (it may not even run efficiently on a processor that is a member of the processor family for which the original code was written). Worse still, you will have to learn several different assembly languages in order to move your code amongst processors. While learning a second or third assembly language is much easier than learning your first, learning all the idiosyncrasies that you must know to write fast code still requires quite a bit of work. So it seems that porting code involving assembly language is not a brilliant idea.

On the other hand, Software Engineering Researchers typically point out that coding represents only about 30% of the software development effort. Even if your program were written 100% in pure assembly language, one would expect that it would require no more than 40% of the original effort to completely port the code to a new processor (the extra 10% provides some time to handle bugs introduced by typos, etc.).

Perhaps you're thinking 40% is pretty bad. Keep in mind, however, that porting C/C++ code doesn't take zero effort; particularly if you switch operating systems while porting your code. If you're the careful type, who constantly reviews their code to ensure it's portable, you're simply paying this price during the initial development rather than during the porting phase (and there is a cost to carefully writing portable code). I am not trying to say that it is as easy to port assembly code as it is to port C/C++ code, I'm only saying that the difference isn't as great as it seems. This is especially true when porting code between operating systems that have different APIs (e.g., porting between flavors of UNIX is easy; now try Unix → Windows → Macintosh → OS/400 → MVS → etc.).

Is assembly language easier to read than HLL code?

Is assembly language easier to read than HLL code? Being an expert assembly language programmer and a fairly accomplished C programmer, I find my own assembly language programs only slightly more difficult to read than my own C programs. On the other hand, I generally take great pains to structure my source code so that it is fairly easy to read. I will say this—I've seen some assembly code out there that is absolutely unreadable. Of course, I've also seen my share of C/C++ code that looks like an explosion in an alphabet soup factory.

Of course, only the person doing the reading can really make this judgment call. Obviously, if you know assembly but don't know C/C++, you'll find assembly is easier to read. The reverse is also true. I happen to know both really well and I find a well-written C/C++ program a little easier to read than an assembly language program. Poorly written examples in both languages are so bad they are incomparable. Once a program is unreadable, it is difficult to determine how unreadable it is.

Quick quiz: What does the following C statement do and how long did it take you to figure this out?

*(++s) && *(++s) && *(++s) && *(++s);

Most people (who know 80x86 assembly) would find the corresponding 8084 code much more precise and readable:

        mov bx, s
        mov al, 0
        inc bx
        cmp al, [bx]
        jz Done

        inc bx
        cmp al, [bx]
        jz Done

        inc bx
        cmp al, [bx]
        jz Done

        inc bx
Done:

Is assembly language harder to maintain than HLL code?

This notion exists because people tend to save assembly language programming for the very time critical (and often complex) components of their program. Obviously if you've spent a lot of time and effort arranging the instructions in a certain sequence to ensure the pipeline never stalls, and then you discover that you need to modify the computation that is going on, the new changes will introduce a lot of work since you will have to reschedule each of the instructions.

Of course, it never occurs to people that similar low-level optimizations that occur in HLL programs are very difficult to maintain as well. Consider the well-written (from a performance point of view) Berkeley string routines. These routines need to be completely redone if you move from a 32-bit processor to a 16-bit processor or a 64-bit processor.

As a general rule, any code that is optimized is difficult to maintain. This has led to the proverb "Early optimization is the root of all evil." People perceive that it is difficult to maintain assembly code mainly because the assembly code they've had to deal with is generally optimized code.

What if we don't go in and pull out every unnecessary cycle out of a section of assembly code? Will the code be easier to maintain? Sure. For the same reason non-optimal C code is easy (?) to maintain.

Of course, one of the primary reasons for using assembly language is to reduce the use of system resources (i.e., to optimize one's program). Therefore, when using assembly language in place of a HLL, you're typically going to be dealing with hard to maintain code. Don't forget one thing, however, had you chosen to continue using a HLL rather than dropping down into assembly language, the optimization that would have been necessary in the HLL would have produced hard to maintain HLL code. Keep in mind, optimization is the root of the problem, not simply the choice of assembly language.

Is assembly language easier to write than HLL code?

Is assembly language easier to write than HLL code? There are certain algorithms that, believe it or not, are easier to understand and implement at a very low level. Bit manipulation is one area where this is true. Also see the section on floating point arithmetic later in this document for more details.

Is assembly language practical on RISC machines

I personally don't know not having really learned assembly language on a RISC chip. I have certainly heard of individuals who have written some butt-kicking code in assembly on a RISC, but this is generally third-hand knowledge. I do know this, though. One of the design principles behind the original RISC design was to study the instructions a typical compiler would use and throw out all the other instructions in a typical CISC instruction set. This suggests that an assembly language programmer has less to work with on RISC chips than on CISC machines. Nevertheless, I will not comment on this subject since I don't have any first hand experience. I invite those who have mastered RISC assembly to write a guest essay for this series.

Is it easy to implement algorithm xyz in assembly language?

That depends entirely on the algorithm. Generally, algorithms will fall into one of four categories:

  1. Horrible solution in assembly, horrible solution in some HLL.
  2. Horrible solution in assembly, elegant solution in some HLL.
  3. Elegant solution in assembly, horrible solution in some HLL.
  4. Elegant solution in assembly, elegant solution in some HLL.

Show me your algorithm, I'll tell you which category I think it belongs in.

How many people know assembly language?

Although the number of people who know assembly language increases daily (faster than programmers are dying or forget assembly language), the number of people who know a given HLL is generally increasing much faster. While this says something bad about assembly language, what it has to do with the question "Will compilers ever produce better code than a human?" is an interesting question in its own right.

Will code optimized for one member of a CPU architecture be optimal for another?

Probably not. This is one big advantage compilers have. If you get a new compiler for a later chip in a CPU family, all you've got to do is recompile your code to take advantage of the new architecture. On the other hand, your hand-written assembly code will need some manual changes to take advantages of architectural changes in the CPU. This fact alone has driven many to condemn writing code in assembly. After all, today's super-fast program may run like a dog on tomorrow's architecture. This argument, however, depends upon two fallacies:

  1. Tomorrow's compilers will also take advantage of these architectural features.
  2. The assembly language program used architectural features on today's chips that cause performance losses on tomorrow's chip.

Historically, compilers for the x86 architecture have lagged architectural advances by one or two generations. For example, about the time the Pentium Pro arrived, we were starting to see true 80486 optimizations in compilers. True, many compilers claim to support "Pentium" optimizations. However, such compilers do very little for real programs. Given past support from compiler vendors, coupled with the fact that the trend is to handle really tedious (e.g., instruction scheduling) optimizations directly in the hardware, I personally feel that worrying about a specific member of a CPU family will become a moot point.

Those claiming that hand written assembly language is inferior because the next member of a CPU family will render the code obsolete are missing the whole point of assembly optimization. Except in extreme cases, assembly language programmers rarely optimize at the level of counting cycles or scheduling instructions (as the pro-compiler crowd point out, this is really too tedious a task for human beings). Assembly language programmers achieve their performance gains by typically using "medium-level" optimization that are CPU family dependent, but usually independent of the specific CPU. This is such an important concept that I will devote a completely separate essay in this series to this subject.

More to come...

Further essays in this series will address the question "Is there a true need to use assembly language?" The "compilers can generate code that is just as good as humans" is one (albeit incorrect) negative answer to this question. In the following essays I will attempt to answer this question in the positive sense.

Part II: Economic concerns

This particular essay regurgitates some material from my first essay, except it applies an "economic spin" to many of those principles.

Economic concerns

Okay, it's time to back off from what is possible and start talking about things that are realistic. In particular, if one is willing to accept the fact that compilers will never produce code that is better than humans (or even equal to), can they produce code that is good enough? In other words, is it cost effective to use assembly language these days?

Quite frankly, there are only a few reasons for using assembly language in a modern program:

  1. System resources (e.g., memory and CPU cycles) are precious.
  2. You have been ordered to use assembly language (e.g., you are taking a course on assembly language).
  3. Assembly language provides the cleanest solution for a given problem.
  4. You want to use assembly language because it's fun.

Some might argue that we've long since past the point where case 1. above applies. After all, memory is cheap and CPU cycles are cheap. Why worry about them? Well, this is an example of having those "UNIX blinders" firmly in place. There are lots of computer systems where memory (both RAM and ROM) are very tight. Look at any microcontroller chip, for example. Likewise, these microcontrollers run at 1MHz, 2MHz and similar speeds (some use clock dividers, so there may be 12MHz or better going into the chip, but the fastest instruction may require six clock cycles). A typical microcontroller has 128 bytes (that's bytes, not KBytes or MBytes) and executes instructions slower than 1 MIPS. Yes, you can buy better parts; however, if you're planning on building a million versions of some Barbie doll, the difference between a $0.50 microcontroller and a $5.00 microcontroller can make the difference between the success and failure of your product (since the price of the components affects the retail price of the doll).

Well, nothing can be done one way or another about point 2. above.

Some people may find point 3. hard to believe. They firmly believe that assembly language is always harder to read and harder to use than a HLL. However, there are a (small) class of problems for which assembly language is much better suited than a HLL. For example, try rotating bits in a character sometime. Much easier to accomplish in assembly language (assuming the presence of a common rotate instruction) than in a language like C.

Perhaps point 4. is a sign of mental illness. However, I have found certain projects in assembly language to be much more enjoyable than the same project in a HLL. This is certainly not to imply that assembly language is always more fun to use. Quite frankly, I find languages like Snobol4 and Delphi a lot more fun than assembly on many projects; however, assembly language is the more interesting language to use on several projects (ask me about my implementations of TIC-TAC-TOE sometime).

The important point to note here is that I am not claiming that assembly language is always the appropriate language to use. Such an argument is obviously flawed for a large variety of reasons. I simply want to point out that there are some very real reasons for deciding to use assembly language within some projects.

Now to the question at hand: "does it make economic sense to use assembly language in a project?" Mostly, the answer is no. Unless your project falls into one of the first three categories above, assembly language is probably the wrong choice. Let's discuss some of the reasons:

Assembly language is not portable. Absolutely true as I've stated earlier. You are not going to (easily) move assembly code from one CPU to another. Indeed, it's difficult moving assembly code from one O/S to another even when using the same CPU. If your code needs to be portable across operating systems or CPUs, assembly language should be avoided for all but case 1. above. For example, I recently needed to write a function that rotated bits in a character variable. This was part of an encryption algorithm I used to transmit data between a PC client and a database server. On the PC side (where most of the code is written in Delphi) I chose to write the code in assembly because it was easier to use. On the server side, I chose to write the code in C because it needed to move between various CPUs and operating systems. Sure, the C code was harder to write and harder to understand (assuming you know some 80x86 assembly language), but portability was more important. On the client side, however, I used assembly because it was easier to express the algorithm in assembly language. Portability was not a concern because the program was written using Delphi (available only for Windows) and the program adhered to the Windows user interface (also limiting execution to a Windows machine). Counterpoint: Portability is not easily achieved in high level languages either. While compilers have made it easy to switch CPUs, try changing operating systems sometimes. This will break all but the "dumbest" of programs. Software engineering studies indicate that only about 30% of a programmer's time is spent coding. If assembly language represents 10% of your code, you will need about 3% of the total project time to port the assembly code to a different processor by completely rewriting it. While a truly portable program is an awe-inspiring achievement, so is a program that runs twice (or better) as fast. Portability is almost always achieved at the expense of performance. Many programmers feel that their code runs fast enough and they don't need to sacrifice portability. However, programmers often use state-of-the-art machines whereas their users typically deploy the software on low-end machines. Fast enough on the programmer's machine is usually quite slow on the end user's machine. Is portability worth this? A hard question to answer. On the one hand (giving up portability for performance via assembly) you will need to expend effort for each new machine that comes along and you will need to maintain, separately, that portion of the code that is different on each machine. On the other hand, your product may wind up running twice as fast (by only rewriting 10% of the code).

Assembly language code is hard (expensive) to maintain. Why limit this to assembly language? Code is expensive and hard to maintain. Is assembly language considerably more expensive? Not in my experience. True, I've seen a lot of unreadable assembly language code in my time (since I teach assembly language and machine organization at UCR), but I've seen an equal amount of poorly written C/C++ code that was just as hard to read. True, you could claim that my assembly language code will be a disaster when someone who doesn't know assembly tries to maintain my code. My only response is "Have you seen what happens when you give a COBOL programmer some C/C++ code to maintain?" I'm not going to argue that assembly language code isn't more expensive to maintain than C/C++ code; experience bears out the fact that it is. On the other hand, the situation is not as bad as some people make it out to be. The biggest problem I've seen is that programmers are less disciplined with their assembly source code than they are with their HLL source code, when, in fact, they need to be more careful.

It takes a lot of effort to write code in assembly language. I have explored this statement over the years to try and find out why people think assembly language is difficult to use. I believe the reasons fall into two camps: those who don't know assembly language very well and those who've only looked at highly optimized assembly language. The first group is easy to dismiss. It takes a lot of effort to write Pascal code if all you know is BASIC. It takes a lot of effort to write C++ code if all you know is Pascal. It takes a lot of effort to write Ada code if all you know is, well, pick just about any language :-). No matter what language you're using, you're going to find it takes a lot of effort to use it if you don't know it that well. Think back to how long it took you to really learn C++ (or some other HLL). Was that effortless? It probably took you a year or so to get really good at it. What makes you think you can get really good at assembly in less time? Most people's experience with assembly language is limited to a 10 or 15 week course they took in college. If your only exposure to C/C++ was a ten week course and you practiced it about as much as the average person practices assembly language, you'd claim it takes a lot of effort to write C++ as well (by the way, this argument applies to any language, I'm not picking specifically on C++). Programmers who have worked with assembly language on a regular basis for at least a year will tell you that assembly language is not that difficult to use.

It takes a lot of effort to write code in assembly language, Part II. In the second camp are those individuals who have nearly fainted when looking at a sequence of optimized assembly language code. Hand-written, highly optimized assembly language can be very scary indeed. However, the difficulty with this code is the fact that it's optimized, not that it's written in assembly language. Writing optimal code in any language is a difficult process. Take a look at the Berkeley strings package for C/C++ sometime. This is a highly optimized package written specifically for 32-bit CPUs. Hand-written assembly code won't do much better than this stuff (I know, I've tried and my average improvement was about 75%, nothing like the 200–500% I'd normally have gotten). Conversely, if you don't need to write the fastest possible code in the world, writing in assembly language is not that difficult. A big complaint, "assembly language programmers are always re-inventing the wheel" has been silenced since the release of the "UCR Standard Library for 80x86 Assembly Language Programmers" several years ago. With the UCR Standard Library, assembly isn't a whole lot more difficult than working with C/C++ (at least, for someone who knows basic assembly language).

Corollary to the above: It takes a lot of effort to learn assembly language. Absolutely true. It takes a lot of effort to learn any new language. Note that I am not talking about learning a language that is quite similar to one you already know (e.g., learning C after learning Pascal or learning Modula-2 after learning Ada). By that argument, assembly is easy to learn. Once you know 80x86 assembly, picking up 68000 or PowerPC assembly isn't all that difficult. If you only know a few imperative languages, try learning (really learning) Prolog sometime. I personally found Prolog much more difficult to learn than assembly language.

People are more productive in HLLs than they are in assembly. This is generally true. Therefore, unless there are other problems (e.g., system resources) dictating the use of assembly language, it is unwise to start doing everything in assembly from the start. Even if you know that the final product will be written completely in assembly language (e.g., an embedded system), it often makes sense to first create a prototype of the system in a HLL. Counterpoint: Often, a programmer who doesn't know assembly language will struggle with a function (or other program unit) that is simply too slow and wind up taking more time to optimize that module in the HLL than it would have taken to write the code optimally in assembly. The end result is that, had the programmer used assembly upon discovering the performance problem, they would have been more productive. Of course, it's very difficult to quantify how often this situation occurs.

All of these issues lead to one inescapable conclusion: it costs more to develop code in assembly language than it does in a language like C++. Likewise, it costs more to maintain an assembly language program (probably in the same proportion to the development costs). Another conclusion one will probably reach is that it will take you longer to develop the code using assembly language (hence the greater cost). On the other hand, assuming you're using competent programmers, you will probably get a better product. It will be faster, use fewer machine resources, and will probably contain less "gold-plating" that often results in additional support problems.

A common argument against the above statement is that "and your users will have to deal with more bugs because the code was written in assembly language." However, the extra testing and debugging necessary has already been factored into the statement "it takes longer and costs more to use assembly language." Hence the quality argument is a moot point.

Does this mean that assembly is suitable for any project where the code might use too many system resources? Of course not. Once again, if portability is a primary concern, your development costs will increase by about 40% of the original cost for each platform to which you port your 100% assembly application. However, if you, like 80% of the world's software developers, write your code specifically for an Intel machine running Windows, the judicious use of assembly language at certain points is easily justified.

The "programmer's ego"-centric view of software economics

In the 50s and 60s computer resource were very expensive. Computers literally cost hundreds of dollars per hour to operate. At the time, a programmer typically made around $17,000/year (by the way, if that seems really low, keep in mind that it's probably equivalent to about a $35,000 annual salary in today's dollars). If a programmer got a program working in a month and then spent a second month working on it to double the speed, clearly such effort paid for itself in a small period of time. I was an undergraduate in the middle 70s, just at the end of this phase. I certainly remember the importance instructors and graders placed on writing optimal code.

In the middle to late 70s, however, all this began to change. The advent of the microcomputer all but totally eliminated this concept of charging for CPU time by the hour. For the cost of one hour's CPU time in 1976, I can now purchase a CPU that is about 5–10 times faster than that old IBM 360/50 that was charging me $150/CPU hour to use. All of a sudden, managers discovered that the programmer's time was far more valuable than the computer's time. Today, if a programmer spends an extra month doubling the speed of his/her code, the bean counter in the front office get very upset because the code cost twice as much to write. This led to a complete rethinking of software economics, a school of thought that persists today.

Of course, programmers have done nothing to promote this view; NOT! Software engineers have been "programmed" with the concept that they are invaluable resources in the organization. Their time is valuable. They need support to ensure they are as productive as possible. And so on... So the programmer who would have written a fairly good program and then spent twice as long making it run twice as fast gets criticized for the effort. Soon, programmers who write good, solid code, using decent algorithms, discover that their peers who write sloppy "quick and dirty" code but get it done in half the time, are getting all the accolades and pats on the back. Before long, everybody is in a mode where they are seeking the fastest solution, which is generally the first one that comes to mind. The first solution that comes to mind is generally sub-optimal and rather low quality. In particular, quick and dirty solutions typically require an excess of machine resources. Look no farther than at today's bloated applications to see then end result of this.

The salvation of the quick and dirty school of programming has been the computer architects. By continuously providing us with chips that run twice as fast every 18 months (Moore's law) and halving the cost of memory and CPUs in about that same time frame, users haven't really noticed how bad software design has gotten. If CPUs double in speed every 18 months and it takes about two years to complete a major application (written so quickly it runs at half the speed of a preceding application), the software engineers are still ahead because CPUs are running better than twice as fast. The users don't really care because they are getting slightly better performance than the previous generation software and it didn't cost them any more. Therefore, there is very little motivation for software engineers to change their design practices. Any thought of optimization (in any language) is a distant memory.

Economically, the cost of the machine is insignificant compared to the software engineer's time. By a similar token, it is possible to show that, in many cases, the user's time is more valuable than the software engineer's. "How is this?" you might ask, "software engineers make $50.00/hour while lowly users make $5.00/hour." For a commercial software product, however, there are probably 10,000 potential users for each software engineer. If a software engineer spends an extra three months ($25,000) optimizing code that winds up saving an average user only one minute per day, those extra three months of development will pay for themselves (world-wide, economically) in only one month. Every month after that, your set of users will save an additional $25,000.

Of course, this claim assumes that they use that extra minute per day to be especially productive (rather than visiting the water cooler with the minute they saved), but overall, any big gain you make in the performance of a program translates into really big gains world-wide if you have enough users. Therefore, software engineers need to consider the user's time as the most valuable resource associated with a project. Programmer time should be considered secondary to this. Machine time is still irrelevant and will continue to be irrelevant.

What does this have to do with assembly language? Well, if you've optimized your program as best as you can in a HLL, assembly is one option you may employ to squeeze even more performance out of your program. If you know assembly language well enough (and it really does take an expert to consistently beat a compiler) the extra time you spend coding you application (or part of it) in assembly pays off big time when you consider your user's time as well.

Of course, if you don't anticipate a large number of users (take note, UNIX users :-)) the engineer's time remains the most valuable commodity.

The 90/10 (or 80/20) rule and other software engineering myths

A common reason for the worthlessness of assembly language is "software spends 90% of its time in 10% of the code. Therefore it doesn't make sense to write an application in assembly language since 90% of your effort would be wasted."

Okay, what about that other 10%? The problem, you see, is that software engineers often use this excuse as a way of avoiding optimization altogether. Yet this rule definitely claims that at least 10% of your code is in dire need of optimization.

Hypothesis: 90% of the execution time occurs in 10% of your code.

Myth #1a: You only need to rework 10% of your code

The 90/10 rule, especially the way software engineers throw it around in a cavalier manner, suggests that you can easily locate this 10% of your program like it were a tumor and surgically remove it, thereby speeding up the rest of your program. Gee, now that didn't take too much effort, right?

The problem with this view is that the 10% of your code that takes most of the execution time is not generally found all in one spot. You'll find 2% here, 3% there, 1% over in the corner, two-thirds of a percent somewhere else, maybe another 1% in the database routines. If you think you can dramatically speed up your code by surgically replacing 10% of your code with some assembly language, boy do you have another think coming. Unfortunately, some 1% segment that is slow is often directly connected to another 1–2% that isn't slow. You'll wind up converting that connecting code as well. So to replace that 1%, you wind up replacing 3% of your code. If you're very careful, you'll probably find you wind up replacing about 25% of your program just to get at that 10% that was really slow.

Myth #1b: You only need to rework 10% of your code

Okey, let's assume you manage to locate that 10% of your code and you optimize it so that it is five times faster than before (a good assembly language programmer can often achieve this). Since that 10% of your code used to take 90% of the execution time and you've sped it up by a factor of five, that means it now consumes only 18% of the execution time. This implies that 82% of the execution time is now being spent somewhere else (i.e., in the other 90% of the code). This suggests that you could still double or event triple the speed of your program (after the first optimization process) by attacking the remainder of the program. Funny thing about mathematics—numerically, the faster you make your program run, the easier it is to double or triple the speed of your program. For example, if I was able to speed up my program by a factor of four by attacking that 10% of the code responsible for 90% of the time, if I can locate another 10% of the code responsible for 80% of the execution time in the remaining code, I can easily double the speed of my program again.

Of course, the astute reader will point out that the 90/10 rule probably doesn't apply repeatedly and as you optimize your code it gets more difficult to keep optimizing it, but the whole thing I'm trying to point out here is that the 90/10 (or 80/20) rule suggests more than it really delivers. In particular, it is a poor defense against using assembly language (indeed, on the surface it suggests that you should write about 10–20% of your code in assembly since that's the portion of your program that will be time critical).

Is the 90/10 rule a good argument against writing entire programs in assembly language? That is a point that could be debated forever. However, I've mentioned that you'll probably wind up converting 20–25% of your code in order to optimize that 10%. The amount of effort you put into finding that 10%, plus the effort writing the entire program in a HLL to begin with could come fairly close to "paying" the cost of writing the code in assembly in the first place.

Myth #1c: You only need to rework 10% of your code

The 90/10 rule generally applies to a single execution of a program by a single user. Put two separate users to work on the same program, especially a complex program like Microsoft Excel (for example) and you can watch the 90/10 rule fall completely apart. Those two users could wind up using completely different feature sets in MS-Excel resulting in the execution of totally different sections of code in the program. While, for either user, the 90/10 rule may apply, it could also be the case that these two users spend most of their time executing different portions of the program. Hence, were you to locate the 10% for one user and optimize it, that optimization might not do much for the second.

The "rule of fifths" (the 20/20 rule)

A big problem with the 90/10 rule is that it is too general. So general, in fact, that many common programs don't live up to this rule. Often you will see people refer to the "80/20 rule" in an attempt to generalize the argument even more. Since it is difficult to directly apply the 90/10 (or 80/20) rule directly to an arbitrary program, I have invented what I call the "rule of fifths" that is a more general statement concerning where a program spends its time.

The "rule of fifths" (I refer to it as the 20/20 rule) says that programs can be divided into five (not necessarily equal) pieces:

  1. Code that executes frequently,
  2. Code that seldom executes,
  3. Busy-wait loops that do not contribute to a computational solution (e.g., waiting for a keypress),
  4. Code that executes once (initialization code), and
  5. Code that never executes (e.g., error recovery, sections of code inserted for defensive programming purposes, and dead code).

Obviously, when attempting to optimize a program, you want to concentrate on those sections of code that fall into category 1. above. The problem, as with the 90/10 rule, is to determine which code belongs in category 1.

One big difference between the "rule of fifths" and the 90/10 rule is that the "rule of fifths" explicitly acknowledges that the division of the code is a dynamic entity. That is, the code that executes frequently may change from one execution to another. This is especially apparent when two different users run a program. For example, user "A" might use a certain set of features that user "B" never uses and vice-versa. For user "A" of the program, the 10% of the code that requires 90% of the execution time may be different than for user "B". The "rule of fifths" acknowledges this possibility by noting that some code swaps places in categories 1. and 2. above depending upon the environment and the user.

The "rule of fifths" isn't particularly good about telling you which statements require optimization. However, it does point out that three components of a typical program (cases 3.–5 above) should never require optimization. Code falling into category 2. above shouldn't require optimization, but because code moves between categories 1. and 2. one can never be sure what is truly in category 1. vs. category 2. This is another reason why you will wind up having to optimize more than 10% of your code, despite what the 90/10 rule has to say.

Assembly language isn't intrinsically hard to write

As I've mentioned earlier, assembly language isn't particularly hard to write—optimized assembly language is hard to write (indeed, optimized any language is hard to write). I was once involved in a discussion about this topic on the Internet and a hard-core C programmer used the following statement to claim people shouldn't try to write optimized code in assembly language:

You'll spend a week writing your optimized function in assembly language whereas it will take me a day to do the same thing in C. Then I can spend the rest of the week figuring out a better way to do it in C so that my new algorithm will run faster than your assembly implementation.

Hey folks, guess what? This programmer discovered why it takes a week to write an assembly language function, he just didn't realize it. You see, it would take me about a day to implement the same function in assembly as it took him to write that code in C. The reason it takes a week to get the optimized assembly version is because I'd probably rewrite the code about 10 times over that week constantly trying to find a better way of doing the task. Because assembly language is more expressive than C, I stand a much better change of finding a faster solution than does a C programmer.

What I'm trying to point our here is that people perceive assembly language as being a hard language to develop in because they've never really considered writing anything but optimized code in assembly. If optimized code is not what you're after, assembly language is fairly easy to use. Consider the following MASM 6.1 compatible code sequence:

var
        integer i, k[20], *m
        float   f
        boolean again
endvar
DoInput:
        try
        mov     again, false
        geti    i
        except  $Conversion
        printf  "Conversion error, please re-enter\n"
        mov     Again, true
        endtry

        cmp     Again, true
        je      DoInput
        printf  "You entered the value %d\n", i
         .
         .
         .
        cout    "Please enter an integer and a real value:"
        cin     i, j
        cout    "You entered the integer value ", i, \
                " and the float value ", j
         .
         .
         .

A C++ program, even one who isn't familiar with 80x86 assembly language, should find this code sequence somewhat readable. Perhaps you're thinking "this is unlike any assembly language I've ever looked at; this isn't real assembly language." You'd be wrong. As I said, this code will assemble with the Microsoft Macro Assembler (v6.1 and later). You can run the resulting program under DOS or in a console window under Windows 3.1, 95 or NT. The secret, of course, is to make sure you're using version two (or later) of the "UCR Standard Library for 80x86 Assembly Language Programmers."

"Well that's not fair" you'd probably say. This is the UCR Standard Library, not true assembly language. My only reply is "Okay, then, write a comparable C/C++ program without using any library routines that you didn't write yourself. Then come back and tell me how easy it is to program in C or C++."

The point here is that assembly language can be very easy to write if:

I do want to make an important point here—you cannot write "a C program that uses MOV instructions" and expect it to be fast. In other words, if you use the C programming paradigm to write your assembly language code, you're going to find that any good C compiler will probably produce better code than you've written yourself. I never said writing good assembly code was easy, I simply said that it is possible to make assembly code easy to write.

Does this mean that you need to write "optimized assembly or assembly not at all?" Heavens no! The 90/10 rule applies to assembly language programs just as well as it does to HLL programs. Some distinct portion of your code does not have to be all that fast (90% according to the 90/10 rule). So you can write a large part of your application in the "easy to write but slow form" and leave the optimization for the hard sections of the program. Why not just write those rarely used portions in a HLL and forget about using assembly? Sometimes that's a good solution. Other times, the interface between the assembly and HLL gets to be complex enough that it introduces inefficiencies and bugs so it's more trouble than it's worth.

Amortization

For those not familiar with the term, amortization means dividing the cost of something across some number of units (time, objects produced, etc.). With respect to software development, it is important to amortize the cost of development against the number of units shipped.

For example, if you ship only one copy of a program (e.g., a custom application), you must amortize the cost of development across that single shipment (that is, you had better be collecting more money for the application than it cost you to develop it). On the other hand, if you intend to ship tens of thousands of units, the cost of each package need only cover a small part of the total development cost. This is a well-known form of amortization. If it were the only form, it would suggest that you should keep your development costs as low as possible to maximize your profits. As you can probably imagine, those denying the usefulness of assembly language often use this form of amortization to bolster their argument.

There are other effects amortization has on the profits one might receive on a software project. For example, suppose you produce a better product (faster, smaller, more features, whatever). Presumably, more people will buy it because it is better. You must amortize those extra sales across the extra expense of developing the software to determine if the effort is worth it. Writing code in assembly language can produce a faster product and/or a smaller product. In some cases you can easily justify the use of assembly language because the profit of the extra sales covers the extra cost of development and maintenance. Unfortunately for those who would like to use assembly language for everything, it is very difficult to predict if you will, indeed, have better sales because you used assembly language to develop the program. Many people claim this to be the case, I remain skeptical.

One area where the use of assembly language is justifiable is if you can trade off development costs (basically a fixed, one-time cost known as NRE—Non-Recoverable Engineering—free) for recurring costs. For example, if you develop a program for an embedded system and discover you can fit the code into 4K of ROM rather than 8K of ROM, the microcontroller you must purchase to implement the product will cost less. Likewise, if you write a program that is twice as fast as it otherwise would be, you can use a processor that runs at one-half the clock frequency; this also results in a lower-cost product. Note that these savings are multiplied by the number of units you ship. For high-volume applications, the savings can be very significant and can easily pay for the extra effort required to completely write a program in assembly language.

Of course, if you do not intend to ship a large quantity of product, the development cost will be the primary expense. On the other hand, if you intend to ship a large quantity of product, the per-unit cost will quickly dwarf the development costs (i.e., amortize them away).

Part III: Arithmetic

I was an undergraduate at the tail end of the "efficiency is everything" phase of software engineering. Despite the premium placed on efficiency, the common thought was "numerically intensive (floating point) applications are not worth writing in assembly language because the floating point arithmetic is so slow." Sort of the precursor of the 90/10 rule, I guess.

Since those days floating point hardware has become ubiquitous and on many machines (e.g., Pentium and later for the x86 architecture) the performance of floating point arithmetic rivals that of integer arithmetic. Hence, avoiding numeric computation in assembly language simply because it is so slow to begin with is no longer a valid excuse.

A friend of mine, Anthony Tribelli, has done quite a bit of work with three-D (i.e., floating point intensive) calculations. He recently switched from a software based fixed point scheme to use the floating point hardware built into the Pentium processor and achieved much better results. Tony's applications need to be fast and, although the 3-D matrix transformation (i.e., matrix multiplication) code was not the biggest bottleneck in his code, he has spent considerable time speeding up this portion of his code as an academic exercise. In particular, he has tried just about every commercially available x86 C compiler out there in order to determine which would produce the fastest code for his application (MSVC++ wins, by the way; much to the disappointment of many GNU fans). Although Tony's assembly language skills were somewhat rusty, he grabbed the Pentium programmer's manual and searched the Internet for Pentium optimization tricks. He hand coded his program to within two cycles of the theoretical maximum speed (i.e., Intel's published cycle counts). The resulting code was significantly faster than that produced by any of the C compilers. This example disproves the adage that you shouldn't bother rewriting numeric intensive applications in assembly language because you won't gain anything.

Despite the anecdote above, this is not an essay about how to speed up your numeric intensive applications by using assembly language. Instead, I would like to concentrate on the fact that assembly language gives you complete control over the code the CPU executes; and sometimes this can make a big difference in the accuracy of your computations.

Consider the following two arithmetic expressions:

x/z + y/z, (x + y)/z

In the mathematical world of real arithmetic, these two expressions always produce the same results for any values of x, y, and z. Simple seventh-grade algebra can easily prove this. Unfortunately, the rules for (infinite precision) real arithmetic do not always apply to (finite precision) floating point arithmetic or to integer arithmetic. Given appropriate values for x, y, and z, the two expressions above can definitely produce different answers. The best way to demonstrate this is to use the integer values x = 2, y = 1, and z = 3. The first expression above produces a zero result given these values, the second expression above produces the value one.

Perhaps you're thinking "gee, that's not fair, everyone knows that integer arithmetic truncates all intermediate values." You're right (about the integer arithmetic, anyway). But keep this fact in mind: all floating point operations suffer from truncation error as well. Therefore, the order of expression evaluation can have a big impact on the accuracy of your computations.

Consider the C programming language. In order to produce high-quality output code, the C language definition allows a compiler writer to take certain liberties with the way it computes the result of an expression. In particular, a compiler can rearrange the order of evaluation in an expression. Consider the following "C" statement:

a = (b + c) / (d + e);

The C language specification doesn't determine whether it first computes (b + c) or (d + e). Indeed, if the compiler so chooses it can rearrange this expression as

a = b / (d + e) + c / (d + e);

if it so chooses. For example it could turn out that the program has already computed c/(d + e) and it has already computed b/(d + e) and it decides to use this information in this computation to reduce this operation to a single addition. You cannot force a certain order of evaluation using operator precedence. Operator precedence simply doesn't apply in the expression above. Perhaps you're a little more knowledgeable than the average C programmer and you're aware of these things known as "sequence points" in an expression. Sequence points only guarantee that any side effects an expression produces will be completed before crossing a sequence point, they do not guarantee order of evaluation. For example, a ";" (at the end of a statement) is a sequence point. Consider, however, the following C statements:

a = (d == c);
b = (x < y) && (d == c);

Although the "(d == c)" subexpression appears to the right of the "&&" operator (that defines a sequence point), most good C compilers will evaluate the expression "(d == c)" prior to evaluating "(x < y)" because they will use the value already computed in the previous statement. Therefore, you cannot force the compiler to compute "(x < y)" before "(d == c)" by simply breaking the second statement above into the sequence2:

temp1 = (x < y);
temp2 = (d == c);
b = temp1 && temp2;

The bottom line is this: you cannot (portably) control the order of evaluation of subexpressions in a C program.

It gets even worse! The values of many expressions in C/C++ are undefined! Consider the following (very famous) example:

i = 4;
A[4] = 2;
i = i + A[i++];
printf("%d\n", i);

A neophyte C programmer might be tempted to claim that this program would produce the output "6". A more seasoned C programmer might claim the answer could be "6" or "7" (one of these answers is probably what you would get). However, the ANSI-C definition claims the result is undefined. That is, it could be anything. Even "126456" is a reasonable result according to the ANSI-C standard.

Using assembly language eliminates all these problems. Since the CPU executes the instructions you supply in the order you supply them, you have precise control over how you compute values3. By carefully studying your computations and the values you expect to supply to those computations, you can choose an instruction sequence that will maximize the accuracy of your system. You can also specify in a non-ambiguous way those instructions whose side effects produce undefined results in C++. For example, if you wanted the previous expression to produce six, you'd use code like:

; Assume 32-bit integers.
;
;       i = 4;

        mov     i, 4

;       A[4] = 2;

        mov     a[4 * 4], 2

;       i = i + A[i++];

        mov eax, i
        add eax, a[eax * 4]
        inc i       ; Of course this could go away...
        mov i, eax

;       printf("%d\n", i);

        printf  "%d\n", i

There is another issue you must consider. If you are working on a CPU with a floating point unit (e.g., the 486 or Pentium), most internal (to the FPU) computations use a full 80 bits. Once data leaves the chip it is generally truncated to 64 bits. Therefore, you will get more accurate results if you leave temporary calculations in one of the FPU's eight registers. While good x86 compilers generally do all their computations within a single expression on the FPU, I haven't noticed any that attempt to keep long term variables in the FPU registers. I certainly don't know of any compiler that would do this (across statements) as a means of maintaining the precision of an important variable; that simply requires too much intelligence.

While this essay will not attempt to explain how to maximize the accuracy of your computations (that is well beyond the scope of this essay), hopefully you can see that assembly language's absolute control over the execution sequence provides some important benefits in those rare cases where "order of evaluation" can affect the outcome of your computations.

Part IV: Fast enough isn't

In this essay, I would like to spend some time talking about the speed of a program. This essay is a plea of higher performance programs, not necessarily a plea that programmers write their programs in assembly language. It is possible to write slow programs in assembly language, it is usually possible to write faster programs in a HLL. Of course, people generally associate the use of assembly language with high performance software, hence the inclusion of this essay in this series. In this essay, I will discuss come of the reasons (excuses) programmers give for not writing fast programs and then I will discuss why performance is still an issue today and will continue to be an issue in the future even with machines 1,000 times faster than those we have today.

I was an undergraduate at UC Riverside in the middle 70s, just at the end of the "efficiency is everything" period of software engineering. In the late 60s and early 70s, it was still common to find large application programs written in assembly language because the cost of running software far exceeded the cost of writing the software. Since then, Moore's law has been in full swing. Machines have doubled in speed every three or so years and the prices have dropped dramatically. Since the 70s, of course, the cost of developing software has overtaken and far exceeded the cost of running the software on a given machine4.

This deemphasis on efficiency has produced an obvious side effect—since schools no longer teach their students to write efficient code, the students never get any optimization experience. Since they never get this experience, they are completely unable to optimize code when the need arises. Human nature is to ignore what you do not understand. Hence most programmers make excuses for why a program cannot be or should not be optimized. Here are many of the common excuses:

I'm not going to bother addressing all the excuses above on a point by point basis. Most excuses are exactly that—an excuse trying to cover up the programmer's own inadequacies. Some of them, however, are worth a few comments.

Future technology. Someday computers will be fast enough (and compilers will be producing fast enough code) that today's dog software will at a respectable rate. For example, today's computers are typically 1,000 times faster than computers that were available 20 years ago. Programs that were too slow to run on those machines run just fine today (e.g., 3-D graphics and multimedia applications). If your program runs at about half the speed if should, just wait three years and computers will be fast enough (and compilers will be generating faster code) so your application will perform in a satisfactory manner.

To understand what's wrong with this picture, just take a look at your own personal machine. If it's a relatively state of the art machine, figure out how much three and four year old software you have running on it. Probably very little. You're probably running the latest version (or nearly the latest version) of every program you commonly use. Programmers who feel that all they have to do is wait a few years for hardware technology to catch up with their software forget that three years down the road they will be writing software that requires the machines to be faster still. That software will probably require two or three processors to run reasonably well. The end result is that the end users inevitably wind up running the latest version of the software quite a bit slower than it really should be running. Since most software is purchased by new machine owners, those buying the software rarely have the opportunity to "downgrade" their programs to an earlier version (since they don't own the earlier version).

Optimization is too expensive. You will often hear programmers using phrases like "market window" and "time to market" as reasons for avoiding an optimization phase in their software. While these are all valid concerns, these same programmers think nothing of spending additional time to add new features to a product even though these new features increase the development cost, lose market opportunity, and increase the time to market. A programmer who eschews performance, something every user can appreciate, for an obscure feature that almost no one will ever use (but looks good on a product comparison matrix) is really fooling themselves.

Perceived vs. actual speed. From operating systems theory we learn that there are several different ways to measure the performance of a software system. Throughput is, essentially, a measurement of the amount of calculation a software system achieves within a given time interval5. Response time is a measure of how long a program takes to produce a result once the user supplies the necessary input(s) to a computation. Overhead is the amount of time the system requires to support a computation that is not directly related to producing the result of a computation.

Overall throughput is an important measure. It describes the amount of time a user will take while running a program to produce some desired result6. If you increase throughput, you will increase users' productivity since they can finish using the program sooner and begin working on other tasks.

Response time and throughput, interestingly enough, are often at odds with one another. Programming techniques that improve response time often reduce throughput and vice versa. However, poor response time gives the perception that a program is running slow, regardless of the actual throughput. In most cases, response time is actually more important than actual throughput. The actual speed of an interactive program is less important than the user's perception of its speed. A lot of research into human response time indicates that users perceive quantum differences in performance rather than incremental improvements. Generally, users can perceive the following differences in response time:

Instantaneous response time is what every application should shoot force. As soon as the user hits the enter key or otherwise indicates that a computation may now take place, the program should be back with the result. Even fast response time, although noticeable, goes largely ignored by a typical user. However, once the response time of a program heads into the delayed or sluggish area, users tend to get annoyed with the software. This creates a distraction that affects them psychologically and results in slight lower productivity greater than the throughput of the program would suggest.

Once a program's response time exceeds a few seconds and approaches 10 seconds or so, a very bad thing happens: the response time exceeds the user's attention span and the user loses his or her train of thought. Once the answer finally does appear, the user has to remember what they were doing resulting in even less productivity.

Somewhere between 10 seconds and a minute, the user starts looking for a completely different task to work on. Once the user is involved with another task, the information provided by the current computation may go unused for some time period while the user wraps up the other task.

The last phase associated with response time is loss of memory—users simply forget that they were working on a given problem and, being involved in something else, may never think to look back to respond to the information provided. Let me give a real good example of this problem. I started a backup on my Win 95 machine. The backup takes about one to two hours. So I started working on this essay in the meantime. As I type this sentence, the backup has long since completed, but I'd forgotten about that backup (and the fact that I really should be working on a different problem than this essay on my Win 95 machine) since I'd become involved with this essay.

There are a few important things to note about these response time categories. First, it generally doesn't help/hurt the perceived performance of a program if you change its response time and the new response time still falls within the same category. For example, if your program's response time improves from four down to two seconds, most users won't really notice a big improvement in the system performance. Users do notice a difference when you switch from one category to the next. The second thing to note is that "near future technological advancements" generally do not speed up your software to the point it will switch from one category to the next. That typically requires an order of magnitude improvement; the type of improvement that is possible only with a major algorithmic change or by using assembly language.

If you cannot improve the response time of a program to the point where you switch from one category to the next, you may as well concentrate on improving the throughput of your system. Just make sure that improving the throughput doesn't impact response time to the point it falls into a slower category.

Fast enough isn't. Now for the real point of this essay. A large number of programmers simply feel that their programs are fast enough. Perhaps they've met the minimal performance requirements specified for that program. Perhaps it runs great on their high-end development platforms (which are often two to four times faster than a typical user's machine). Whatever the case, the developer is happy; should s/he waste any time making that program run faster? Well, this essay wouldn't exist if the answer were no. After all, fast enough, isn't.

Consider a typical application. If a software developer has written his or her software so that it runs just fast enough on a given platform, you'd better believe that software was tested on a machine with no other software running at the same time. Now imagine the poor end user running this software on a Macintosh, a Win 95, a Win NT, or a UNIX machine, along with several other programs. Now that program that was fast enough is running dog slow. Look folks, a simple fact of life is that you can no longer assume your software has the machine all to itself. Those days died with MS-DOS.

On the other hand, if you make your software run twice as fast as it really needs to, then two such programs can run concurrently on a machine and still run fast enough. Likewise, if your program runs four times faster than it really needs to, four (or more) such programs could run concurrently.

Of course, a typical developer might claim that multiprocessor systems will solve this problem. Want to run more programs? No sweat, just add more processors. There are two problems with this theory. First, you have the future technology problem mentioned above. As users purchase machines that have multiple processors, they will also be purchasing software that winds up using all the power of those multiple processors. Second, there is a limit to the number of processors you can add to a typical system and expect performance to improve.

Of course, one cannot generalize this argument to every piece of software in existence. Some programs, for example, do have all the resources of the underlying system (an embedded system, for example). Nevertheless, for commercial applications one would expect to buy for a personal computer system, it shouldn't be the task of the software developer to determine how to waste CPU cycles, that should be the user's.

Part V: Levels of optimization

Over the years, I've seen the quality of compiler output improve steadily. Although I firmly believe that an experienced assembly language programmer who puts his or her mind to it will always be able to beat a compiler, I must still admit that some code sequences produced by certain compilers have amazed me. Indeed, I've actually learned several nifty tricks by studying the assembly language that several compilers produce. So why am I convinced that compilers will never be able to beat a good assembly language programmer? It's simple really, compilers do not optimize the same way humans optimize. Needless to say, human optimization is generally superior.

Despite the progress made in AI over the years, compilers are not intelligent. Indeed, they are maddeningly deterministic. Feed the same inputs to a compiler over and over again and you're going to get exactly the same output. Human beings, on the other hand, are somewhat non-deterministic. Feed a human being the same problem three times and you're likely to get three different solutions. The greatest advantage to using a compiler is that such systems are preprogrammed to handle massive amounts of details without losing track of what is going on. Human beings are incredibly poor at this task and often seek shortcuts. These shortcuts are often divergences that result in a new and unique solution that is better than that produced by a deterministic system7. A good programmer, seeking the most efficient solution to a problem, will often solve the problem several ways and then pick the best solution. Although compilers do a little of this, it's at a much lower level where the payoff isn't so great.

It is common to hear the anti-assembly language crowd claim that assembly language is a dead issue because humans cannot possibly schedule instructions, keep track of lots of registers, and handle all the cycle counting chores as well as the compiler; particularly across a large software system. You know what? They're right. And if this was the way assembly language programmers applied optimizations to any but the most frequently executed statements in a small program, compilers would be winning hands down. Humans simply cannot keep track of lots of little details as well as a compiler can. Oh, for short sequences, humans can do much better than a compiler, but across a large program? No way. Fortunately (or unfortunately, depending on how you look at it), this isn't the way assembly language programmers manage to write fast code.

Consider for a moment the lowly compiler. The job of a compiler is to faithfully convert your HLL code into a form the machine can execute. As a general rule, the only optimizations that people find acceptable are those that involve a better translation from the HLL to machine code. Any other changes or transformations to the code are out of question. For example, suppose you coded an insertion sort in C. Would it be acceptable for your C compiler to recognize the insertion sort and swap a merge sort in its place? Of course not. As a general rule, most people would not allow their compilers to make algorithmic changes to their code. Returning to the example above, perhaps you chose the insertion sort because the data you intend to sort is mostly sorted to begin with. Or, perhaps, you're sorting a small number of items. In either case, the insertion sort may be able to outperform the merge sort implementation even though the reverse if more often true.

Of course, the first argument the uninitiated make when claiming you shouldn't use assembly language to optimize your code is that you should simply choose a better algorithm. There are two problems with this argument: (1) finding a better algorithm isn't always possible or practical, and (2) why can't the assembly language programmer use that better algorithm as well? Indeed, this second point leads in to the main point of this essay—while any algorithm you can implement in a HLL like C can also be implemented in assembly language, the reserve is generally not true8. There are some algorithms that lead themselves to easy implementation in an assembly language that are impossible, or very difficult, to implement in a HLL like C. Often, those algorithms that are easy to express in assembly language are the ones that produce better performance.

It is common knowledge that there are high-level optimizations (e.g., choosing a better algorithm) and low level optimizations (e.g., scheduling instructions). My experience indicates that there are far more than two different levels of optimizations. Indeed, there is probably a continuous spectrum of optimization types rather than a finite (and small) set of optimizations possible. However, for this essay I will assume we can break the world of optimizations into three discrete pieces: high-level optimizations, intermediate levels of optimization, and low level optimizations. One important fact about these levels of optimization is that the higher the level, the greater the promise of improved performance.

A classic example of multiple optimization levels occurs in a program that sorts data. At the highest level of optimization, you must choose between various sorting algorithms. Depending on the typical data you must sort, you might choose from among the bubble, insertion, merge, or quicksort algorithms (hint to those who weren't paying attention in their data structures class, a bubble sort isn't always the slowest algorithm). Your choice of the sorting algorithm will have a very big impact on the overall system performance. For example, if your data set contains just the right values, a 1,000 record database could require 100 times as long to sort if you choose an insertion sort over a merge sort. Let's assume you have the wisdom to choose the best algorithm for your particular application.

As you're probably aware, sorting algorithms that involve comparing data elements require between O(n*log(n)) and O(n2) operations to complete. Asymptotic (e.g., Big-Oh) notation is generally one of those concepts that students tend to study weakly in their college courses; they get out into the real world and forget all the details associated with Big-Oh approximation to running time. In particular, most students forget about the effects of the constant associated with the asymptotic approximation. While polynomial or even exponential performance improvements are possible with better algorithms, most applications in the real world are O(F(n)) where F(n) <= n (sorting being an obvious exception). Few real-world applications use algorithms that are O(n*log(n)), much less O(n2) or O(2n). Therefore, most real-world speed-ups involve attacking the constant associated with the Big-Oh notation rather than attacking the polynomial function. Remember two algorithms can be O(n) even if one application is always 100 times faster than the other. I don't know of too many programmers who wouldn't rather have their programs run 100 times faster if the cost of implementation wasn't excessive. Attacking the constant falls into the intermediate and low-level optimization classes.

Let's return to the sorting example. A high-level optimization would consist of choosing one algorithm with better performance characteristics than another algorithm under consideration. Very low level optimization consists of choosing an appropriate sequence of machine instructions to implement the sorting algorithm. At this very low level, you must consider the effect of instruction scheduling on the pipeline, the effect of data access on the cache, the number of cycles each instruction requires, which registers you must use to hold live data, etc. As the pro-compiler group properly points out, keeping track of this much global state information for all but the shortest assembly sequences is beyond most humans' capabilities.

In-between those two extremes are certain optimizations (medium level optimizations) that would probably never be done by a compiler but are fairly easy to implement. For example, consider the sorting algorithm discussed earlier. Suppose you are sorting an array of records and the records are rather large. If each record consumes a large number of bytes, it might take a significant amount of time to swap array elements that are out of place. However, if you create an array of pointers (or indexes) into the array of records to sort and use indirection to access the data, you can speed up your sort quite a bit. Such algorithms only produce constant (multiplicative) improvements. But who wouldn't like to see their program run five or ten times faster? Note that these types of optimizations could never be done by a compiler (maybe you really do require the sort to physically swap the array elements), any more than one would expect a compiler to choose a better algorithm.

To summarize, the choice of one high-level algorithm over another may produce dramatic differences in execution time. Choosing a different algorithm may change the basic function that describes the execution time of the program (e.g., from exponential down to polynomial, from polynomial down to linear, from linear down to logarithmic, or from logarithmic down to constant). Most other optimizations only affect the constant in the Big-Oh equation. So they do not have the ability to produce as large an improvement.

Medium level and low-level optimizations attack the constant in the Big-Oh equation. Low-level optimizations typically cut this constant in half, maybe even divide it by four. For example, if you perfectly schedule all your instructions in a pipeline or on a superscalar machine, you would eliminate stalls, thereby speeding up your program by the number of stalls present before the optimization (this is generally 1/2 of the time for architectures circa 1996). It would be very rare to find such low level optimizations improving the performance of a program by an order of magnitude.

Medium-level optimizations, on the other hand, might have a larger impact on the performance. Consider, again, the sorting example. If every comparison the sorting algorithm does requires that you swap data, and you records consume 400 bytes each, it will take about 100 times longer to sort the data by actually swapping the records than it would to swap pointer to the data. This is a dramatic difference9.

Okay, what's this got to do with assembly vs. HLL programming? Although assembly language programmers can use arrays of pointers, so can HLL programmers. What's the big deal? The big deal is simply this: there are many intermediate level optimizations that are possible only in assembly language. Furthermore, many intermediate optimizations, while possible in a HLL, may only be obvious in assembly language.

Let me begin with an optimization that is not really possible in a HLL. A while back, I gave my assembly language students the task of writing a Tic-Tac-Toe program in assembly language. This turned out to be a major programming assignment for them. Many of the students got stuck, so I advised them to first write the program in C++ (to be sure they could solve the problem) and then convert the C++ program, manually, into assembly language. A typical solution using this approach was between 800 and 1,000 lines of 80x86 code. In order to demonstrate that an experienced programmer wouldn't need to write such a large program, I began writing a series of solutions to this problem in assembly language. Now understand one thing, my goal was not to write the fastest or shortest implementation of this program (a lookup table version would probably claim this prize), I was mainly interested in demonstrating that there are many different ways to implement a solution in a given programming language, with some solutions being better than others. One novel solution I created used boolean logic equations to define the moves. Another solution parsed a series of regular expressions that defined the moves. Yet another solution involves the use of a context free grammar. Of course, there was the ubiquitous lookup table version.

During the implementation of the Tic-Tac-Toe solution, I took advantage of the fact that one really needs to concern themselves with placing an "O" in one of the following squares:

(0, 0), (0, 1) and (1, 1)

To handle any other move, one need only rotate the board 90 degrees and check these three squares on the resulting game board. By repeating this process four times, you can easily check all squares on the board. In a HLL like C++ or Pascal, students typically create an array to hold Xs and Os that appear on the board and then physically rotate the board by moving characters around in the array.

While implementing the boolean equation version of the Tic-Tac-Toe solution, I discovered that I could use two nine-bit values to represent the Tic-Tac-Toe game board. One of the nine-bit arrays would hold the current X moves, one nine-bit array would hold the O moves. By logically ORing these two nine-bit arrays (that I actually stored into a pair of 16-bit words) I was able to determine which squares were occupied and which were empty. By using bits zero through seven in the word to represent the outside edges of the game board and using bit eight to represent the center square, I was able to do the board rotate operation with a single 80x86 machine instruction: ROL. If the AX register contained the TTT array of bits, then the "ROL AL, 2" instruction did the board rotation that I needed. Contrast this with the number of machine instructions a compiler would generate to rotate the board as an array of bytes. Even if you use the bitwise operators in C, it is unlikely a good compiler would recognize this sequence and generate a single instruction from it (I am not willing to say impossible, just unlikely).

By now, you're probably saying "who cares? Do we really need an efficient version of the TTT game and if we did, why not use a lookup table (that probably would be faster)?" Fair enough. But you see, the point of this discussion is not that I can develop a slick TTT game. The point I'm trying to make is that assembly language programmers have certain options available to them that simply are not available to HLL programmers. Those options represent the main reason it is generally possible to write faster and/or shorter code in assembly language.

Of course, not all intermediate level optimizations are possible only in assembly language. Some of them are possible in certain HLLs as well. The classic example I can think of is the implementation of the Berkeley C Standard Library string package for 32-bit processors. Consider the standard implementation of strcpy:

char *strcpy(char *dest, char *src)
{
    char *Result = dest;

    _while( *src != '\0' ) /* uses the "randy.h" macro pkg */

        *dest = *src;
        ++dest;
        ++src;

    _endwhile
    return Result;
}

(Note that a typical definition merges nearly the entire function into the while loop, e.g., "while(*dest++ = *src++);" It sacrifices readability for no good reason. Almost any modern compiler will optimize the (more readable) version I've given to emit the same code as the typical implementation. If your compiler cannot do this, you have absolutely no business trying to claim compilers emit code as good as a human).

As any advanced assembly language programmer will tell you, the strcpy routine above could actually run four times faster on a 32-bit processor if you were to move 32-bit quantities on each loop iteration rather than one-byte quantities. The resulting code is difficult to write (at least, correctly) and difficult to debug if there are errors in it, but it does run much faster. The Berkeley strings package does exactly this. The resulting code is many times faster than conventional code that only moves a byte at a time. Note that the Berkeley code is even faster than an assembly version that moves a byte at a time. Of course, 80x86 assembly programmers can write faster code (I've improved the Berkeley package on an NS32532 processor, I haven't done this on then 80x86 yet), so this is no proof that C generates code as good as an assembly programmer, but it does demonstrate that some intermediate level optimizations are possible in HLLs, as well.

Note, however, that the optimizations found in the Berkeley string package are not obvious to a C programmer, but they are quite obvious to an assembly language programmer. Indeed, I suspect the original designer of the Berkeley C strings package was very familiar with assembly language or, at least, computer architecture. Therefore, although some intermediate level optimizations are possible, it is unlikely that most HLL programmers would recognize the opportunity to use them, unless, of course, that HLL programmer was also an accomplished assembly language programmer. On the other hand, many of these intermediate level optimizations are quite obvious to an assembly language programmer, so you will often find them in a well-written assembly language program.

Often, I hear the argument "optimization in assembly language is futile. Today, members of an architectural family are rapidly changing. And the rules for optimization change with each member of the family. An assembly language programmer would have to rewrite his/her code when a new CPU comes along; a HLL programmer, on the other hand, need only recompile his/her code." This statement generally applies to low level optimizations only. True, CPU family members differ widely with respect to pipelining, superscalar operation, the number of cycles for a given instruction, etc. Any assembly language programmer who attempts to optimize his/her program by counting cycles and rearranging instructions is probably going to be disappointed when the next CPU arrives. However, my experience is that very few assembly language programmers write code in this fashion for any program except the most speed sensitive applications. As the HLL people are fond of pointing out, it's just too much work. I will gladly concede that compilers, overall, will pick more optimal instruction sequences than I am willing to generate for a typical assembly language program I write. Yet programs I write for the 486, that outperform compiled programs on the 486, still outperform their counterparts on the Pentium (my assembly code goes unchanged, the compiled code is recompiled for a Pentium)10. Why is this? Shouldn't the compiled Pentium code run faster because of the superscalar nature of the code? Yes, it does. Typically between 25% and 50% faster. However, my assembly code was five times faster than the original 486 code, so doubling the speed of the compiled code still produces a program that is much slower than my unchanged assembly code.

Of course, I can see some readers out there doing some arithmetic and claiming "Hey, sooner or later the compiled code will beat you." However, such readers are not in tune with recent architectural advances. The current trend in superscalar/pipelined architectures is "out of order execution". This mechanism will reduce the impact of instruction scheduling on a program. That is, the CPU will automatically handle much of the instruction scheduling concern. Therefore, future compilers that spend considerable effort rearranging instructions will not have a tremendous advantage over code that has not been carefully scheduled (e.g., hand written assembly language). In this case, architectural advancements will work against the compilers.

I would point out that much of the preceding discussion is moot. You see, the arguments the pro-HLL types are advancing assume that these wonderful compilers exist. Outside of laboratories and supercomputer systems, I have yet to see these wonderful compilers. What I have found is that some compilers perform certain optimizations quite well, different compilers perform other optimizations quite well, but no compiler I've personally used combines all these techniques into a single system (of course, certain optimizations are mutually exclusive with respect to other optimizations, so this goal isn't completely achievable). Furthermore, I have yet to see a fantastic compiler that generates good Pentium-optimized code. The current (1996) crop of compilers lose steam beyond the set of possible 486 optimizations. I personally haven't looked at the Pentium pro and how optimization on that chip differs from Pentium, but if there are any differences at all, I could easily believe that a Pentium Pro specific compiler will never appear. You see, compilers are examples of programs that are CPU architecture sensitive. While the pro-HLL types are quick to point out that assembly programmers rarely upgrade their software to new architecture, they seem to forget that compilers suffer from the same problems (even though compilers are not typically written in assembly language). You can talk about how all you've got to do is recompile your code to get "Pentium-optimized" programs until you're blue in the face. But if you don't have a Pentium optimized compiler, you're still generating 486 (or earlier!) code and it isn't going to compete well with assembly code from the same era.

Perhaps you're going to claim "hey, the compiler on my XXXX RISC workstation is really hot. Your assembly code on the x86 is going to run slower than my HLL code." That's probably true, but you're comparing apples to oranges here. After all, one would expect code on a 500MHz Alpha to outperform my assembly code on a 66MHz 486. The real issue here, is the market. If my code performs well on a 66MHz 486, I will sell a lot more copies than your program that runs on a 500MHz Alpha. The bottom line is this, if you can afford to limit your market to UNIX-based RISC workstations (note than I am specifically exempting the Macintosh here), go for it. You probably won't need assembly language and whether or not assembly is faster is a moot point. "The rest of us," however, don't have the luxury of developing software on these blazing fast machines. The compilers available to us lowly x86 and PowerPC programmers don't come close to matching what assembly programmers can do (I will leave it up to others to debate whether those RISC workstation compilers are truly better than PC compilers).

To summarize, assembly language programmers do not write faster programs than a HLL compiler produces because they are better at counting cycles and rearranging instructions. Although it is possible for a human to produce a better sequence than a compiler, we're talking about a factor of two or three here. It is hardly worth the effort to write code in this fashion (unless you're really up against the wall with respect to performance). Furthermore, code written in this fashion has all the drawbacks your college professor warned you about: it's hard to read, hard to understand, hard to maintain, and grows obsolete with the next release of the CPU.

Real assembly optimization occurs at the intermediate level. Intermediate level optimizations are often machine independent, are almost always CPU family member independent, and generally produce much better results (in terms of performance gains) than low level optimizations. Furthermore, you rarely have to sacrifice readability, maintainability, or portability (within the CPU family) to achieve good results.

Often, you will hear someone refer to the "assembly language paradigm" or "thinking in assembly vs. thinking in C." These phrases are referring to the types of intermediate level optimizations that assembly language programmers often employ. Because most HLL programmers (especially those who argue against the use of assembly language) do not have sufficient experience with these intermediate optimizations to fully appreciate them, it is very difficult for a diehard assembly language programmer to convince a diehard HLL programmer that there really is something special about assembly language programming. Hopefully, this essay has helped articulate some of the advantages of assembly language programming.


1I have seen some posts on the net by a certain individual who spends his days writing games for cartridge based game machines who swears that using anything but assembly language is crazy. This is probably true in his domain. It certainly does not apply to software written for UNIX workstations.

2Actually, this might not happen. I'm not 100% sure if C is required to complete all computations that produce side effects across a computation, or only those that would affect that outcome. Clearly, storing a result into a variable is a side-effect of an expression. However, given the fact that most good optimizing C compilers support optimizations like "reaching definitions" and "live/dead variable analysis" I suspect my assertion is correct.

3Strictly speaking, this is not true. Newer processors are beginning to support "out of order" execution of instructions. However, one would hope that Intel and other chip makers would ensure that any out of sequence executions would not alter the results one would normally expect in a serially executing program.

4It has not necessarily exceeded the cost of using the software. See the essay on economic concerns for more details.

5Throughput is actually the inverse of this—the number of tasks completed in a given time interval. This essay will ignore this difference since both views describe the same thing.

6I will ignore the amount of time the program spends waiting for user input in this discussion. If the user gets up and takes a coffee break in the middle of using a program, that shouldn't logically affect the throughput at all. Throughput describes what the program is capable of, not what actually happens.

7It is worth pointing out that this divergence produces less efficient results as often, or more often, than it produces efficient results.

8In a sense it is true since you could write an 80x86 simulator in a HLL like C. However, the obvious performance drawbacks to such an approach limit its feasibility.

9By the way, the requirement that we swap all the records in the array is present only so the overhead normally associated with the sorting operation (the non-swapping overhead) doesn't adversely affect the numbers I'm presenting. There is still a significant difference even if you only swap a certain number of elements in your array.

10Actually, I have yet to verify this since there are no compilers available to me that have true "Pentium" optimizations. It was certainly the case between the 386 and the 486, however.