Roy K. McDonald 26 July 1993

Software Acceleration

Presented at the Sumeria Technologies & Issues Conference

Hardware gets faster every year. We’ve all come to expect it. And, a huge amount of work is going on right now to ensure that next year the same thing will happen.

Software gets more features. And unfortunately, all too often, the presumption that fast hardware will take up the slack has meant that inelegant software design needlessly eats up performance advances. The irony is that software improvements are often far more dramatic in their impact than hardware improvements. Hardware is the tortoise, advancing relentlessly in tens of percents per year; software is the hare – on occasion it leaps orders of magnitude.

This article reviews what has been done in software acceleration on the Mac, highlighting how much more could be done right now. I aim to persuade you to think about Mac performance as a hybrid of hardware and software acceleration and perhaps shift your priorities a little in favor of pushing the envelope on code rather than silicon.

Decade of Macintosh Hardware Advances — Let’s start by seeing what can be done with hardware. How has Macintosh hardware improved in performance over the past 10 years?

The original 128K Mac had an effective speed of roughly 1/2 MIP. Today’s Quadra 950 provides about 8 MIPs. Of course, the Quadra 950 is relatively expensive, so on a real $/MIP basis, the growth is only eight-fold, equivalent to a yearly average improvement of 26 percent.

SCSI, NuBus, and AppleTalk speeds have changed less. SCSI may be about twice as fast as it originally was. The new Cyclone NuBus standard will give a four times performance boost. AppleTalk is basically unchanged. And, although EtherTalk has led to a high-speed network standard bandwidth that is roughly twenty times better than what we had in 1984, actual throughput is roughly only a factor of five better.

Typical RAM installation has grown from 128K to the current average of 6 MB, a 50 times growth, or about 50 percent per year. Access speeds of main storage have only improved about a factor of two (although caching has mitigated this otherwise fatal limitation).

Common hard drives seek an average of about five times faster and have ten times the capacity than they did when drives first shipped for the Mac Plus. The average transfer rate hasn’t improved by much more than a factor of two.

Overall, we might imagine a "Speedometer" increase of as much as a factor of 20 over the past decade (with perhaps much more than that for floating-point operations).

That’s not to say that hardware can’t make occasional big leaps, too. RISC processors will provide a roughly three times performance jump on one-third the die size, for an overall price-performance step of ten times in what will probably be a two to three year transition period. DSP can also accelerate certain processes by an order of magnitude.

But, taken all together, typical jobs on a constant-priced Mac have been able to be performed roughly 25 percent faster every year, solely because of technical advances in hardware and increased performance for the price. This means hardware performance doubles roughly every three years, a rate likely to continue for the foreseeable future.

Software Advances — While hardware advances are relentless and pervasive, software improvements are often more specific in their impact. The performance results, however, can be dramatic.

For a familiar example, consider the case of ‘Find File’ running under System 6 versus System 7. For fun, we recently took a Mac Plus running System 7 and raced it against a Mac IIci using System 6. The System 7 software was running on hardware five years older than the System 6 version. Still, Find File went slightly faster on the Plus, because Find File is roughly ten times faster in its current form.

Unfortunately, it often takes a long time for well-known software techniques to enter the commercial sector. For instance, it was many years after the introduction of the first spreadsheet (VisiCalc) before sparse and virtual array techniques were used. If you wanted a 50 by 1,000 cell spreadsheet, you had to have 50,000 cells worth of RAM (say, 800K), even if most cells were empty.

Sparse techniques would have allowed you to use only the amount of memory taken by full cells, and virtual techniques to use disk space as well, at the cost of slower calculation. But the marketing war focussed on porting to new platforms and adding new features, not on saving RAM. A few engineer-years could have saved users tens of millions of dollars worth of RAM.

Many new technologies which seem to arrive because of hardware advances are in fact largely enabled by software breakthroughs. We did a rough analysis of the increased performance in a variety of frontier technologies over the past five years and tried to assess what fraction of speed improvements came from software as opposed to hardware. We concluded that the software components for the various technologies were:

  Voice recognition         80%
  Handwriting recognition   80%
  Dynamic 3D graphics       60%
  Compression               50%

In all cases, some hardware improvement was necessary in order to make the technologies practical, (e.g. DSP) but better software, particularly better software algorithms were the most important enabling technology.

Components of Speed — Where does the speed come from? You can break the software design process into three components: algorithms, implementation, and compilation.

The largest range of performance difference comes from algorithm selection. This may also be the area of poorest performance in the industry today. Factors of 10 and 100 losses in performance are common. Why is this?

Consider the basic Order theory of algorithms. Every computer algorithm can be classed by Order. For example, an Order N algorithm takes twice as long when you run it on twice as much data. An Order N-squared algorithm takes four times as long. Lots of computational problems are easy to code as N-squared algorithms, but can be rewritten with difficulty to scale as NlogN.

A famous example was the introduction of the Fast Fourier Transform in the mid-60’s, an NlogN algorithm that replaced the previous N-squared algorithm.

A 1,024 point transform could thus be performed 100 times faster by this new software method. So this advance was comparable in speed to over 20 years of general-purpose hardware speed improvement. And, it was accomplished through a software change which, once developed, had no marginal cost over the prior solution.

Unfortunately, plenty of commercial software ships every day containing inefficient algorithms. Sorting records in a database is a familiar example where NlogN algorithms can be used but aren’t always. When you scale your data from 10 to 100 records, pixels, or whatever, it means the algorithm may take 100 times longer to run, when it only needs to take twenty times longer.

It’s easy to see why it happens. From the technical perspective, debugging and benchmarking is often done on limited data sets that don’t reveal how badly the code will bog down in real world applications. And the real world constantly increases data set size, often at an exponential rate. Screen diagonal and pixel resolution are two common parameters which quadruple data set size when the parameters double.

Over in marketing, they know that software is not as rigorously benchmarked for speed as hardware, because comparisons are often more difficult to apply. So feature lists and time-to-market become disproportionately important factors.

Good algorithms are not enough. Implementation counts as well. For example, suppose you need code for looking up records in a database. An efficient algorithm for this is Order N – twice as many records means twice as long a search.

The usual way to accomplish this is to index the records in a binary tree. Then you need to do log(2) N index lookups to get the location. To find a single record in a 1,000 record data base requires 10 lookups.

But, if each of these lookups involves a separate hard drive access, the implementation is poor, even though the algorithm is optimal. A better (and more typical) implementation would bring some or all of the directory information into RAM at the time of the first disk hit and cache it there for the next nine lookups. Whether or not you use an optimized algorithm, if the implementation is three times slower than necessary, the overall performance suffers by the same ratio.

Good implementation is often a matter of deep familiarity with the target hardware platform, a familiarity which is increasingly difficult to achieve as technology life cycles shrink ever shorter.

Also, the code we write is not the code the system runs. Between the two stands a compiler.

Within the Mac world one can find a range of commercial C compilers that vary by as much as 30 percent or more in ultimate compiled code performance. To do better than that, one must write in assembler, and here the variations are even greater. To put it bluntly, it’s not hard to do a lot better than MPW.

Looking beyond the Mac, we must face the fact that much more effort has gone into optimizing 80×86 compilers than 680×0 products. As Windows has gained market share, more and more cross-platform benchmarks are being published of essentially identical object code compiled for Windows versus Mac and run on similarly powered CPUs. The Windows products tend to run faster because the compilers are, by and large, a little bit better. The most striking example I’ve seen was a recent PC Magazine benchmark of WordPerfect where the Windows advantage was substantial. This is not because of a superior operating system, but because of the availability of a better optimized compiler.

With the move from CISC to RISC architecture, and especially with the move to superscalar pipelines, ever more burden is placed upon the compiler. If sloppy compilers can be written for CISC machines, time-to-market pressures could produce RISC compilers which have even more of an effect.

The trend in the software industry today is in the opposite direction of this theme. We are all sacrificing performance in favor of time-to-market. Object Oriented Programming is the epitome of this trade-off. Now, there’s nothing wrong with OOP, and it’s great that we’ll all soon be writing Newton applications by dragging and dropping resources from the object pool.

But OOP is an obvious formula for inefficient code. Witness the feel of the Finder in System 6 vs. System 7. In many applications I’ll guess that early products will be sketched in OOP and later, more mature products or versions will be coded at lower levels.

Lately we’ve been thinking about starting a development house that specializes in knocking off popular OOP-based products with C or assembler-based me-too versions. We’d be second to market but we’d win the benchmark wars every time.

System Software — System software is particularly important because of its pervasive impact on performance. Well-written, native-mode system calls are critical to good performance for a wide range of software products, and can to some extent overcome limitations imposed by inefficient compilers. If most of the computer’s time is spent in highly-optimized system calls, the inefficiencies of the calling program can easily be overlooked.

On the downside, many advances in system software have undermined performance. Windowing systems and multitasking both advance overall productivity, but add overhead which slows routine operation. The user gets new functionality, but it doesn’t come for free, and it affects all applications.

Moreover, advances often improve performance in ways that are difficult to define quantitatively. Both virtual memory and RAM disk technology can significantly enhance Mac productivity, but it’s hard to benchmark their contributions. For example, Connectix end-user studies of Virtual and MAXIMA customers indicate that either product can increase total work output per session by 5-20 percent, but results vary widely according to the type of work performed and the system configuration.

An area of particular interest to Connectix is the use of advanced, dynamic disk caching techniques, utilizing all of the often "wasted" RAM on computers to avoid unnecessary disk access. The benefits of this are two-fold:

First, disk accesses are usually a hundred to a thousand times slower than RAM accesses, so tremendous speed improvements can be achieved. Preliminary benchmarks on our Velocity caching product show an overall work throughput increase of about 25 percent. That’s not bad for a low-cost software extension considering what it costs to accomplish the same boost in hardware.

Second, caching has become increasingly important because of portable computing. PowerBook users will enjoy considerable battery life extension through the elimination of unneeded disk spin-ups, which typically account for 10 percent of power use in a battery-powered PowerBook session. Many PowerBook users also complain that their PowerBooks seem sluggish compared to comparable desktop systems – mainly, it appears, because of the random annoying delays of drive spin up.

The key to a successful caching strategy involves maximizing the available cache size and filling it with the data most likely to be called for next by the CPU. Velocity incorporates unique advances in both of these areas, which I look forward to discussing in the future.

Input/Output — One of the most productive areas for software acceleration is in the I/O domain, both internal to the system, and over a network. After all, processing has three major steps – you get the information, then you process it, then you spit out the results. Two thirds I/O, one third processing.

Consider the following thought experiment: Watch a typical user for an hour. She opens files, launches applications, enters alphanumeric data, spell checks, calculates, sends email, closes windows. Now, double the processor speed. Maybe she’ll save 5 minutes out of the hour. Instead, suppose you double the I/O speeds – SCSI, ADB, AppleTalk, and NuBus. How much does she save then? Our testing indicates it’s also about five minutes, and it’s certainly within a factor of two of that either way for most sessions.

Moreover, a lot of the time saved will occur during periods when the user would be especially annoyed at delays. Most people are prepared to watch their clock spin a few seconds when calculating, but have less patience when saving or opening a document. The system just doesn’t seem to be working as hard then.

Hardware I/O speeds are generally not improving quite as fast as raw computation speeds. But a lot can be done in software here. Many I/O bottlenecks give 10 to 1 or even 100 to 1 speed delays. Even though they are only relevant to system operation a small fraction, say 10 percent of the time, addressing these bottlenecks can have a big impact. If you want a graphic example of this, compare benchmark data of third-party 25 versus 33 MHz accelerator boards. With a 33 percent higher clock speed, you often see benchmarks only 10 or 20 percent better, because I/O is setting the pace.

Networks — Enormous increases in network bandwidth are becoming available because of the introduction of new technologies, particularly optical transmission. The underlying structure of network data transmission on the Mac is starting to be strained by these capabilities.

I recently spoke with a vendor who successfully developed an attractive low-cost, high-performance FDDI card with about ten times the effective speed of today’s Ethernet systems. It failed as a product, however, because the throughput of the network bottlenecked at both ends of the link by packet creation and decoding time. This seems like an area ripe for new software paradigms.

Video — There has been little improvement in the software that drives Mac video over the years. This reflects the fact that the Mac started with an excellent foundation, the original version of QuickDraw. Subsequent versions have improved screen draw times by about a factor of two, and big improvements in the future seem unlikely.

User/System — Finally, there is one bandwidth limitation which dominates all others in importance, one link in the I/O chain responsible for 99 percent of the wasted clock cycles in every Macintosh. This, of course, is the interface between the user and the system. Far outweighing compiler, implementation, and even swamping the effect of new algorithms is how efficiently a user can communicate her wishes to the machine, and how in turn the machine can let the user understand or appreciate the results and implications of those actions. The ultimate bandwidth limitation, and the single most important way to improve the total performance of the user-system combination is the user interface metaphor.

The Mac established its special position in the industry by virtue of its unique ability to address this one issue. Essentially, the key technology that enabled it to do so was software. But more remains to be done, and the pace of improvement in the last five years has not been particularly impressive. For all the two thousand engineer years that went into its development, is the Mac a lot easier to use under System 7 than it was before? I don’t believe so, and I hope we’re in for some paradigm shifting breakthroughs here. Personal computing could use such a shot in the arm today.

Conclusion — Time-to-market and feature list forces are driving software developers to work in ever higher-level programming languages and to pay less and less attention to the efficiency of the underlying code. Because hardware speed has increased over the years, they have been able to get away with this for some time.

But considering how much effort goes into pushing the speed envelope of the hardware, it seems like users would be well served if more emphasis were placed on software acceleration. In everything from mainstream applications to system software, users do care about speed and software will often be the best price-performance technology to provide it.

Share

Subscribe today so you don’t miss any TidBITS articles!