Gideon Greenspan 25 March 2002

Bioinformatics and the Mac

Those of us with even a passing interest in science are used to the idea that computers play a central role in understanding physics and chemistry, especially high-powered computation used in areas such as weather prediction and molecular visualization. However, over the past few years, a new target for that computation has emerged and begun to attract media attention. It’s called computational biology (or more catchy, bioinformatics) and it refers to the digital storage, categorization, and analysis of biological data.

If your most recent encounter with biology took place in high school, you may be surprised by any such crossover with computing. Although I always found it fascinating, I remember biology never quite having the rigor of its counterparts in the science curriculum. Some cells did this, other cells had that, and different organisms did all sorts of strange things, especially when dissected by over-enthusiastic schoolchildren. But there seemed to be few universal principles equivalent in scope to Newton’s equations or the periodic table of elements.

Digitizing Life — Thanks to the wonders of molecular biology, many such fundamentals are now known to exist. An overview of some of the basics should give an impression of what is involved – bear in mind that we’re dealing with the natural world in all its complexity, so everything that follows has been vastly simplified.

Life as we know it is encoded in a set of long molecules called DNA, identical copies of which are found in every cell in a living organism such as a human being. Everything that happens within an organism can be traced back to its DNA – just like the hard disk in a computer. In humans, each cell contains 46 separate DNA molecules called chromosomes, analogous perhaps to hard disk partitions. Your chromosomes contain a mixture of information duplicated from those of your parents, which is one reason why you inherited so many of their characteristics.

Any one DNA molecule consists of a series of connected nucleotides forming a chain that can run to lengths of many millions. There are only four possible nucleotides, so any DNA molecule can be represented as a sequence using only four letters. This is where the digitization begins – the entire set of chromosomes for a human being can be stored in a few gigabytes of space (even less after compression) and you can even download a recent draft to your own computer.

<ftp://ncbi.nlm.nih.gov/genomes/H_sapiens/>

According to present-day understanding, only a fraction of your DNA has a purpose – the other 98 percent or so is affectionately named "junk." The meaningful bits, known as genes, are short stretches scattered unevenly throughout the chromosomes (think of them as fragmented program files, if you like). They can be pretty hard to find – we currently have confirmed the existence of about 15,000 human genes, but scientists are still bickering over the total number – most estimates lie around 30,000. There’s even a sweepstakes where you can add your own guess.

<http://www.ensembl.org/Genesweep/>

Genes are interesting because machinery in the cell translates them into another type of molecule called proteins. These proteins perform the organism’s real metabolic work and can be thought of as currently running programs. A protein molecule contains a series of connected amino acids forming a chain, similar to how nucleotides make up DNA. However, in contrast to DNA, proteins are made from 20 different amino acids and are rarely more than a few thousand such elements in length. Sequences of proteins are another type of digital data that bioinformatics regularly deals with.

How are these proteins able to do all the work set out for them: building cells, transporting materials, sending signals and carefully managing each cell’s energy factory? When released into a cell’s watery innards, proteins fold up upon themselves, forming a huge variety of shapes that make them connect to other proteins and molecules in specific ways, catalyzing any number of chemical reactions. Trying to work out which shape a particular protein sequence will fold into is an extremely difficult problem. A biannual contest called CASP (a shortened acronym for Critical Assessment of Techniques for Protein Structure Prediction) is held between different research groups around the world, and IBM is building its fastest ever supercomputer to work on it, at a hoped-for rate of no more than one protein per year.

<http://predictioncenter.llnl.gov/>

<http://www.research.ibm.com/bluegene/>

Again, this is just a basic overview. If you’re thirsting for more information on molecular genetics, the U.S. Human Genome Project has published a good online primer.

<http://www.ornl.gov/hgmis/publicat/primer/ toc.html>

Open-Sourcing the Human Being — With the basic biology lesson out of the way, let’s talk about how bioinformatics applies to the real world. One bioinformatics application you’ve probably heard of is the Human Genome Project. Its seemingly simple goal is to read the roughly three billion nucleotides that make up the human set of chromosomes. This is made possible by the fact that, even though there are millions of points at which healthy human DNA sequences can differ from one another, every one of us is identical in the other 99.9 percent of points. If you find that scary (or perhaps inspiring), remember that your DNA is also about 99 percent identical to the chimp at your local zoo.

Discussions on the genome project began in 1984, but it was not until 1995 that the work began in earnest via an international collaboration of publicly funded laboratories in the United States, United Kingdom, France, Japan, Germany, and China. The public project moved along slowly until 1999 when Celera Genomics, a private venture, joined the fray. Armed with an improved experimental method and gobs of computing power, Celera promised to complete a first draft of the genome within a year. After much politicized mud slinging, a deal was made and the two groups’ results were published simultaneously in February 2001.

<http://www.ornl.gov/hgmis/>

<http://www.celera.com/>

What does all this have to do with bioinformatics? For a start, computers were required to store and index the resulting sequences and make them available to researchers around the world over the Internet. But the real algorithmic problem stemmed from the way in which DNA molecules have to be read. In the biological world, there is no such thing as a debugger which lets you freeze a cell and poke around inside, observing and manipulating at will. Instead, a series of steps must be cleverly combined for a scientist to gain access to a desired item of information.

For any DNA molecule, only about the first 1,000 nucleotides can be ascertained using available laboratory techniques. Longer sequences are scanned by making several copies of the molecule and breaking these up randomly into short fragments, each of which is read separately. The original order of these fragments is lost, so, after reading them, there remains the task of reconstructing the original sequence. It’s not unlike trying to rebuild an encyclopedia using a few photocopies which have been run through an office shredder – the number of possibilities to be tried is vast. Forget about trying to do it by hand – Celera’s draft build required about a week of running time on a 56-processor array with over 100 GB of memory.

The Human Genome Project is a classic example of a bioinformatics problem, and scientists are hopeful that the results will have many practical effects. An immediate consequence is increased speed in the development of new medicines by enabling scientists to hone in quickly on potential drug target genes. It can also be expected to lead the way to personalized health care, as relationships are discovered between the genetic variations that exist between human beings and our susceptibility to certain diseases or treatment responses.

In the distant future, it opens up the possibility of curing disease and even tweaking ourselves through direct manipulation of our DNA. Naturally, the ethical issues raised are daunting and could wreak havoc with our basic notion of what it is to be a human being. However, this is also an area where the field of bioinformatics will shine: the storage, categorization, and analysis of the data promises to better inform the people who will be dealing with these ethical issues.

Apples are Growing — As interesting as all the above may be, you may be wondering what bioinformatics has to do with the Macintosh. Macs are already playing a large role in the bioinformatics domain and will probably continue to do so. Firstly, as with any other sector filled with independently thinking individuals, the scientific community has a high proportion of Mac users. This has been particularly true in biology, where until recently versatile graphics capabilities have been more important than raw computing power.

Nonetheless, until recently the Macintosh had one critical limitation regarding its long-term suitability in the field: the natural preference of bioinformaticians for Unix-based platforms. This is firstly a result of the availability of free, reliable Unix tools such as perl and grep, which make it highly suitable for processing large quantities of text-oriented data. Furthermore, since the explosion of activity in computational biology began around 1995, exactly when the Internet was establishing itself as a mainstream platform for scientific collaboration, the vast majority of bioinformatics applications run over the Internet. Unix’s stable and efficient implementation of TCP/IP, in conjunction with the free Apache Web server, make it ideal for providing these Web-based services. For some idea of what’s available, take a look at the site of the American National Center for Biotechnology Information.

<http://www.bioperl.org/>

<http://www.apache.org/>

<http://www.ncbi.nlm.nih.gov/>

It should be fairly obvious where this takes us: Mac OS X, soon to be the mainstream Macintosh operating system, is not only based on Unix but provides full support for all of its tools – perl, grep, and Apache included. On its own, this does not necessarily place it ahead of other Unix platforms. But if we add the fact that it contains a modern user interface and runs desktop applications such as Microsoft Office and modern Web browsers, it’s not hard to see why Mac OS X is a natural choice for bioinformatics servers and desktops. This has been noted in several places, including an O’Reilly Network article and an Apple viewpoint article. It’s also proven to be more than wishful thinking: Genentech, the company that ordered 1,000 new iMacs (and whose Chairman and CEO is one of Apple’s board members), is one of the founders of the biotechnology industry.

<http://www.oreillynet.com/pub/a/mac/2001/12/14/ macbio.html>

<http://www.apple.com/scitech/stories/osxporting />

<http://www.genentech.com/>

A further bonus for Macs is that the PowerPC G4 processor, with its Velocity Engine processing unit, is ideal for many types of biological computations. BLAST (short for Basic Local Alignment Search Tool) is probably the most popular bioinformatics tool available today. It takes the sequence of a DNA or protein molecule as input and searches for other known molecules which are likely to be connected in evolutionary origin or biological function. Apple’s Advanced Computation Group, in collaboration with others, developed a high-throughput version of BLAST, which they claim makes a dual 1 GHz Power Mac G4 up to five times faster than a PC with a 2 GHz Pentium 4 processor. Fast BLAST searches are crucial to today’s biologists.

<http://www.apple.com/pr/library/2002/feb/ 07blast.html>

Try This at Home — There is at least one way in which all Mac users can get involved in computational biology. A project named Folding@Home, developed in the same style as U.C. Berkeley’s alien-searching SETI@home, lets you contribute to a distributed effort to calculate the physical structure of protein sequences. Folding@Home’s Mac OS X client, a screensaver and application, is now available and provides a real-time graphical view of the structures being tested.

<http://folding.stanford.edu/>

<https://tidbits.com/getbits.acgi?tbart=05401>

That aside, unless you happen to be involved in the academic or commercial computational biology world, the bioinformatics revolution will remain, for now, a distant blip on your daily horizon. But don’t expect it to stay there forever – if the promise of the field is even partially fulfilled, you will start seeing its effects seeping into your daily life.

<http://www.ornl.gov/hgmis/education/ education.html>

<http://www.ncbi.nlm.nih.gov/Education/>

<http://dmoz.org/Science/Biology/Bioinformatics/ Education/>

[In one life, Gideon Greenspan is the persona behind Sig Software, a Macintosh shareware company which develops products such as Drop Drawers, Classic Menu, Email Effects, and NameCleaner. In the other, he is a Ph.D. student of bioinformatics in the Computer Science department of Israel’s Technion. He hopes one day to overcome this dichotomy!]

<http://www.sigsoftware.com/>

Subscribe today so you don’t miss any TidBITS articles!