Chemists would be much more productive

by William J. Cromie The application of artificial intelligence to problems of chemical analysis and synthesis is having dramatic impact on many opera...
Author: Coral Higgins
18 downloads 0 Views 2MB Size
by William J. Cromie The application of artificial intelligence to problems of chemical analysis and synthesis is having dramatic impact on many operations that were once tedious and prohibitively time consuming.

C

hemists would be m u c h more productive if they could see what they do. Virtually all of their work involves the interaction of molecules whose s t r u c t u r e d e t e r m i n e s their activity, b u t these structures often are u n k n o w n and a l w a y s are invisible. For e x a m p l e , the ability of a drug to react with a key enzyme a n d d e s t r o y a disease-causing bacteria depends on the interlocking of the two molecules. If chemists k n e w the details of the molecules' three-dimensional shapes d o w n to the last atom, they would be on their way to the design of more effective drugs. T h e same holds true for making better pesticides, herbicides, s y n t h e t i c fibers, plastics, and other products. T h e ability to discern s t r u c t u r e and watch molecules interact is not enough. Chemists also face the task of planning the best way to synthesize a compound. T h o u s a n d s of starting materials can be combined in millions of ways. Chemistry would be more of a science and less of an art if chemists could call up all the possible routes to a target molecule, whether or not these routes had ever been used before. For the past dozen years, researchers have been developing computer systems designed to open the eyes of academic and industrial chemists and to help them achieve these goals. Some of the systems speed tedious analyses to determine the structures of molecules. Others depict invisible molec-

ular shapes as stereo color images that can be rotated, translated, and tilted much as spaceships are maneuvered in a video game. O t h e r programs are being tested and used for the design of new compounds, the syntheses of complex molecules, and the prediction of reaction products in a chemical factory or in a h u m a n body. As chemists struggle with the problems of adapting such systems for routine use, computer scientists work on easier-to-use (they call them friendlier) and more intelligent systems. T h e goal of the computerc h e m i s t r y field is to i n v e n t a d a t a - i n , a n s w e r - o u t black b o x t h a t w o u l d , for example, allow chemists to feed in a molecular formula and without further effort obtain its structure or a plan by which it could be synthesized. N o such black box exists, and it may not exist for a long time, b u t various versions of w h a t will eventually become part of the inner workings of such a system already are easing some of chemistry's trial-and-error burden.

Machine intelligence T h e computer began its role in chemistry as an automatic librarian. At first it performed searches of scientific literature and lists of compounds. Today, data banks hold information on molecular structure, test results, and chemical reactions. T h e A m e r i c a n C h e m i c a l S o c i e t y ' s Chemical A b s t r a c t Service, for e x a m p l e , offers

Applications. The T-shaped thyroid (left) fits spaces in its carrier protein, albumin, as a pimiento fits an olive. A view down the axis (bottom left) of the double-helical DNA molecule and another view of the molecule rotated, displaying its major and minor grooves.

access to a data base containing information on the structure of more than five million c o m p o u n d s . Using a nationwide system k n o w n as Telnet, researchers can telephone the Chemical Information System, developed by the National Institutes of Health and the Environmental Protection Agency, and obtain all the information available on 150,000 frequently used comp o u n d s regulated by the federal government. W i t h the newer Reaction Access System, a chemist can get data about reactions in which a compound has taken part. Library systems simply store information and release it on request. Computers possess far greater capabilities, and as early as the 1960s computer scientists began applying the techniques of artificial intelligence to problems in chemistry. The data base for such systems consists of a large b o d y of knowledge in a specific area o b tained from the literature and from questioning experts. Such knowledge-based artificial intelligence systems contain both facts and heuristic knowledge—-empirical rules for p r o b l e m - s o l v i n g gained from practical experience. (See "Programmed to Think," Mosaic, Volume 11, Number 5.) T h e first effort in this area originated with the work of Edward A. Feigenbaum, c h a i r m a n of the computer science departm e n t at Stanford University, and Joshua Lederberg, then chairman of Stanford's genetics department. Chemistry attracted Feigenbaum because m u c h factual knowledge already existed in numerical form, easily readable by computers. Lederberg, a chemist, carried a wealth of heuristic rules in his head. T h e two later recruited Stanford organic chemist Carl Djerassi to help get the knowledge base into the computer. Their first project, dubbed Dendral, for Dendritic Algorithm, constituted the first fully automatic approach to structural analysis. T h e program uses data containing structural clues from which it generates all relevant possible structures. Originally, Dendral used mass spectrome t r y data. A s p e c t r o m e t e r separates a molecule into ionized fragments and p r o duces a spectrum in which the mass of each ion is plotted against its relative abundance. T h e chemical composition of a molecule can be ascertained directly from the spectrum, but the structure is not evident. In other words, a chemist can determine which atoms and atomic g r o u p s make u p the molecule but cannot determine their spatial arrangement. Structural features determine t h e p a t t e r n of f r a g m e n t a t i o n , h o w e v e r , and spectra do contain structural infor-

MOSAIC January/February 1983

47

mation. One way to obtain this information is to reassemble the fragments, or s u b structures, in all possible ways and then determine which of the assemblies produces the initial spectra. This would be a herculean task for a h u m a n but not for a large computer. D e n d r a l takes a molecule's a t o m i c groups, whose structures are k n o w n , and reassembles them in every possible way. It uses its heuristic rules to predict a spectrum for each assembly, and it compares predicted and real spectra. Ideally, only one predicted s p e c t r u m will m a t c h t h e real one. But m a n y predicted structures are so similar that this t e c h n i q u e m a y g e n e r a t e h u n d r e d s , even t h o u s a n d s , of plausible candidates.

Making it manageable To pare d o w n the list, the Dendral team is adding to the program the nuclear m a g netic resonance spectrum of c o m p o u n d s in question. Spectra are predicted for each candidate structure and matched against the actual ones. This constrains the list of structural possibilities to those that p r o duce both a k n o w n mass spectrum and a nuclear magnetic resonance spectrum. In one experimental analysis, the computer generated a list of one-and-a-quarter million candidate structures for a compound based on knowledge of organic chemistry principles alone. W h e n the mass spectrometry and nuclear magnetic resonance spectra were added, the list of plausible structures was reduced to one. At the outset, Dendral's capability was limited to a restricted n u m b e r of molecular families. It could not handle cyclic molecules, those, c o n t a i n i n g rings or closed chains of atoms as their principal structural b a c k b o n e . T h i s restriction w a s a significant gap; most Important biological compounds are cyclic. "Without computer assistance, a chemist would sit d o w n with pencil and paper and try to determine all of the possible structures a cyclic molecule could h a v e , " explains Robert S. Engelmore, a computer scientist with Teknowledge, Inc., of Palo Alto, California. " T h e chemist would then publish a paper stating that the u n k n o w n structure must be one of the five or ten that he had laboriously worked out. But no one could be sure he had thought of all the possibilities." In 1975, Stanford chemist D e n n i s H. Smith and his colleagues created a program Cromie is executive director of the Council for the Advancement of Science Writing.

48

MOSAIC January/February 1983

that g e n e r a t e d an e x h a u s t i v e list of all possible structures for a cyclic compound. He used it in a review of papers in the Journal of the American Chemical Society that described manually determined structures. T h e computer program did a better job of discovering all the possible structures than any h u m a n could. To use this program, k n o w n as Congen, for Constrained Generation, a chemist feeds the system all the structural information that can be established u n a m b i g u o u s l y from available data. The computer searches this list for k n o w n substructural features and puts all that it finds on what is termed a "goodlist." It places known substructural features that are missing on a "badlist." The goodlist might indicate that the u n k n o w n molecule is a ketone, but the badlist might specify that it is not a methyl ketone. T h e Congen program takes the goodlist of constraints and generates all molecular structures possessing those features. T h e program then predicts a mass s p e c t r u m for each candidate and compares each one to the spectrum of the c o m p o u n d being analyzed. The result is a short list of structures ranked in order of plausibility. A n experienced chemist can usually handle this list manually to find the most likely structure. More than 20 academic and industrial research laboratories in the United States, Europe, and Australia are using C o n g e n In such areas as analysis of marine products, antibiotics and o t h e r d r u g d e r i v a t i v e s , narcotic analogues, and brain e n d o r p h i n s . In the United States, the program can be purchased commercially or entered through terminal and telephone connections to the Stanford University Experimental C o m puter for Artifical Intelligence in Medicine.

Paring further " D e n d r a l and Congen show that computer programs equal the performance of experts in certain problems of molecularstructure elucidation," says Feigenbaum. " T h e y can solve highly complex p r o b lems, such as analysis of estrogenic steroids and other biologically active c o m p o u n d s . T h e programs may not k n o w as m u c h as an expert does, b u t they succeed because of their thorough application of the rules they do k n o w . " Congen has a major drawback: It cannot handle structural features that share bonds or atoms—adjacent rings having a common side, for example. W h e n a chemist using Congen pinpoints two substructures in a molecule being analyzed, he m u s t be sure they are s e p a r a t e a n d not o v e r l a p p i n g . " T h i s asks a lot of h i m , " notes Stanford

chemist James G. Nourse. To eliminate this disadvantage, Nourse helped to develop a program called Genoa, for Generation of Overlapping Atoms. Available commercially and through Stanford's experimental medical system, this p r o g r a m generates all possible structures whether or not the constraints involve information on overl a p p i n g s u b s t r u c t u r e s in the molecule. "Genoa has not yet replaced C o n g e n in academic and i n d u s t r i a l l a b o r a t o r i e s , " N o u r s e says, " b u t it is m o r e efficient, easier to use, and capable of solving problems that Congen cannot do easily, if at all." Genoa generates not one nor a million candidate structures, but h u n d r e d s . If the program ever is to be used in a comprehensive black-box system, it must be combined with a unit that can further pare the list. Nourse's group is developing a program that uses mass spectrometry data and carbon-13 nuclear magnetic resonance spectra to do the paring. Genoa, combined with the new program, then, will result in a kind of primitive black box; the user feeds it available structural data, and it delivers a list of candidate structures short enough to be handled easily by a competent chemist. As other types of constraint programs become available, such as those that handle proton nuclear magnetic resonance spectra, they can be incorporated.

Learning from nature As Congen and then Genoa made it easier to generate plausible structures, the speed of programs used to predict the spectra and to test the predictions they produced proved inadequate. C o m p u t e r scientists had developed specific rules for predicting the ways structures fragment, drawing on conversations with specialists or from examination of the scientific literature. "To continue doing this on a o n e - b y - o n e basis for each class of molecule w o u l d take until the twenty-first century," Engelmore says. So Feigenbaum and Stanford computer scientist Bruce G. B u c h a n a n tried to design a program that would examine m a n y examples of spectra and come u p with general rules for molecular fragmentation. They wanted to determine if a machine could derive its information directly from nature instead of from experts. The first part of the program Feigenbaum and Buchanan developed, n a m e d Metadendral, collects and summarizes data on the fragmentation of k n o w n structures. T h e second part generates rules to explain these data. The third part tests and modifies the rules, disregarding those that overlap and those that produce negative energy

balances. T h e resulting rules describe the b o n d s that break and the substructures that form in mass spectrometry. According to Buchanan, the machineg e n e r a t e d rules " a r e as good as those generated by h u m a n experts." He and his co-workers proved this by using Metadendral to recreate fragmentation rules formulated in the traditional way for two classes of compounds: aliphatic amines and e s t r o g e n i c steroids. In more s t r i n g e n t tests, the program examined spectra from three classes of steroids for w h i c h n o theory of fragmentation existed. The new rules derived by Metadendral effectively described the behavior of these compounds in a mass spectrometer. " T h i s is the first case in chemistry of a theory successfully generated by machine," Engelmore notes. T h e machine cannot generalize across significant structural differences, however; a separate theory must be generated for each class of organic compounds. To meet this challenge, research on Metadendral has been replaced by work on a project so new that it does not yet possess its o w n acronym—an effort to develop rules for the interpretation and prediction of nuclear magnetic resonance spectra. Initially geared only to carbon-13 spectra, the system has had more success with prediction than with interpretation.

Floppy molecyies Stanford researchers are also wrestling w i t h the p r o b l e m of f l o p p y molecules. These molecules are the m a n y biologically i m p o r t a n t , flexible s t r u c t u r e s that can change shape as conditions change. Proteins, for instance, contain chains of atoms that flutter or rotate about their bonds. T h e entire molecule also vibrates—breathing, as chemists call i t ^ a n d it can change its s h a p e by continuous alteration of such properties as atomic distances and bond angles. A n y black box that really was one would have to deal with such changes. N o u r s e h a n d l e s the p r o b l e m b y r e p r e senting floppy molecules as a matrix of distances between atoms. T h e distances m u s t be consistent with what is k n o w n about a molecule, and changes in the distances are limited by energy balances and other constraints in the program. O n this basis, N o u r s e expects to generate lists of conformational isomers, all the shapes a floppy structure can assume. T h e problem is focal to some of chemi s t r y ' s m o s t i m p o r t a n t processes. " T h e b i n d i n g of important molecules, such as catalysts, is exclusively sensitive to shape," points out Richard J. Feldmann, a com-

puter specialist at the National Institutes of Health. " T h e effect of a molecule in one conformation may be nil, while in another it may catalyze or inhibit a vital r e a c t i o n . S h a p e s r e q u i r e d for specific biological activity have been selected over millions of y e a r s . " Feldmann makes movies depicting the motions of molecules. "Fluttering or rotation o c c u r s in p i c o s e c o n d s , " he says. " B r e a t h i n g occupies milliseconds, and folding takes tens of seconds. It requires 40 h o u r s on a large computer to make a movie showing 30 picoseconds in the life of a very small protein. Dynamic modeling of all the energetic states of a molecule is necessary to completely understand its functioning, but such modeling requires a supercomputer."

Protein crystallography Until massive computing power becomes w i d e l y available, c h e m i s t s m u s t w o r k with s n a p s h o t s of molecules, structural images of a shape responsible for a specific activity. Such a snapshot can be produced by crystallizing a protein, bombarding it with X rays, and determining its threedimensional structure from the diffraction pattern. T h e diffraction data are translated into a series of two-dimensional electron density m a p s r e p r e s e n t i n g closely spaced contours of the molecule. T h e final step involves construction of a wire or balland-stick model of the protein, a difficult and frustrating chore requiring a m o n t h or more. In 1976, Feigenbaum and Engelmore set out to develop a system for determining p r o t e i n s t r u c t u r e from electron d e n s i t y m a p s u s i n g artificial intelligence techniques. Beginning with the knowledgebase route they had used for Dendral and other programs, Feigenbaum interviewed people w h o had built models from maps. " B u t we quickly found that we could not c a p t u r e their expertise in a c o m p u t e r p r o g r a m , " E n g e l m o r e recalls. " N o one used any step-by-step procedure that could be t r a n s l a t e d into rules. Each p e r s o n developed his own method, which Involved a great deal of staring at the maps and a visual kind of reasoning." Engelmore and his colleagues eventually created a program called Crysalis, based on w h a t he calls an "opportunistic, or jigsaw puzzle, approach. You begin with the amino acid sequence of the protein, which can be determined by chemical analysis," he explains. " T h i s is a global constraint, like the picture on the puzzle box. T h e n

you look for parts that have a high probability of being correct, such as edges or corner pieces in the jigsaw puzzle. These would be places in the data where you find some m e a n i n g . " A heme group, for example, appears on the electron maps as a flat, lacelike structure around an extremely dense spot, the iron atom at its center. The program builds u p a structural hypothesis piece by piece as more and more information is gleaned from the maps, the amino acid sequence, and the emerging substructures. T h e p r o gram also uses information about the structure of related proteins. "You keep adding bits of structure and connecting t h e m , " Engelmore remarks. " A s In the case of the puzzle, it becomes easier as you get more pieces into place." In the first test of its intelligence, Crysalis found almost all the structural elements of a small protein of k n o w n structure. Later, Allan Terry of the University of California at Irvine used it to determine the structure of cytochrome c2, a 112-aminoacid protein. Engelmore describes the program as "a demonstration prototype." But, he says, "it is easy to imagine an X-ray diffractometer hooked to a system that makes electron density maps and feeds them to a program like Crysalis, which w o r k s out the structure and displays it as a stereo diagram on a computer screen. All the elements of such a black box exist at present; it's only a matter of t i m e though probably a long time—before they are assembled into a single system."

Knowledge and graphics M e a n w h i l e , c o m p u t e r s are b e c o m i n g friendlier to chemists: They will accept statements and instructions in basic English rather than high-level programming lang u a g e s , and c h e m i s t s can c o m m u n i c a t e with them using their favorite mode of notation, a diagram of points and lines representing atoms and the bonds between them. In a system developed at the U n i versity of California at San Francisco, a chemist draws a diagram of a structure with a light pen on an electrostatic table, and a stereo color model of the molecule appears on a screen. T w o such models can be d i s p l a y e d , r o t a t e d , and m o v e d a r o u n d each other to show h o w the molecules interact. Sitting at a console equipped with two joysticks, a chemist can m a n i p ulate the molecules like a player in a neighborhood electronic game arcade. Robert Langridge, director of the computer graphics laboratory at the University of California at San Francisco's school of

MOSAIC January/February 1983

49

pharmacy, employed this system to model the docking and binding of thyroxine to the protein, prealbumin, that t r a n s p o r t s the thyroid h o r m o n e to its target organs. Prealbumin, s h o w n in one color, contains pockets for holding the iodine atoms in thyroxine, depicted in another color. As the late Eugene C. Jorgensen described it, "the binding site of the transport p r o tein is shaped like an olive with the core plugged out, and the hormone fits into that hole like a pimiento." This was the first detailed molecular model for the interaction of a hormone with a biologically relevant protein. Jorgensen searched this model for regions that could be modified to produce an analogue of thyroid hormone that could be employed to speed development in premature infants. Such infants, whose lungs may not have developed fully before birth, experience respiratory distress s y n d r o m e , a principal cause of their illness and death. Thyroxine bound to prealbumin is too large a package to cross the placenta, so Jorgensen wanted to make an analogue that could be a d m i n i s t e r e d p r e n a t a l l y . W i t h the help of a computer program, he designed a g r o w t h - h o r m o n e carrier package that crosses the placentas of pregnant rabbits and speeds lung maturation in their fetuses. David Ballard, a pediatrician at the University of California at San Francisco, hopes to use such thyroxine analogues to treat infants w h o do not respond to steroid hormones, a standard treatment for the problem. Langridge is using the graphics system to study the binding of proteins and DNA. " T h e prealbumin-thyroxine interaction is a good model for proteins in the nucleus that bind to DNA," he explains. " I t has enabled us to determine the structure of a protein that binds to a k n o w n sequence of DNA in a bacterial virus and switches that particular gene on and off." Herbert W. Boyer, a pioneer in recomb i n a n t DNA t e c h n o l o g y , uses c o m p u t e r graphics to i n v e s t i g a t e the i n t e r a c t i o n between DNA and restriction e n z y m e s used in gene splicing e x p e r i m e n t s . A t Stanford, researchers c o m b i n e g r a p h i c s and artificial intelligence in a p r o g r a m called M o l g e n , for M o l e c u l a r G e n e t i c s , designed to analyze DNA structures and to plan cloning and other experiments.

Designing drugs A well-worn metaphor compares a drug binding a receptor to a key fitting a lock. A researcher can practice biological locksmithing by manipulating atomic g r o u p s

50

MOSAIC January/February 1983

on a graphic display of a molecule until the molecule slides smoothly into a s u b strate. In the living world, the keys might not reach the lock (no transport mechanism), be inserted in the w r o n g lock (nonspecific binding), be bent or altered (metabolized), or be lost (excreted). "Nevertheless, the l o c k - a n d - k e y a n a l o g y is a useful s t a r t i n g p o i n t , " c o m m e n t s Peter Gund, senior research fellow at the Merck Sharp & D o h m e Research Laboratories in Rahway, N e w Jersey. "Computer-assisted modeling helps scientists understand the complex relationships between drugs and receptors. For example, when the structure of a receptor is u n k n o w n , it m a y be i n f e r r e d from the s h a p e of the k e y - d r u g s that fit it." C o m p u t e r s also h e l p d r u g d e s i g n e r s deal with floppy keys and locks. "Dynamic reshaping of both may be required for effecting biological activity," G u n d notes, " a n d such reshaping can be more easily handled with computer graphics. In this sense, a better analogy would be a combination lock." Most drugs possess electrostatic charges that match complementary charges on the receptor. G u n d compares this to a key w i t h m a g n e t s that e n g a g e s a m a g n e t i c tumbler in a lock. As a drug approaches its receptor, the electrostatic fields become perturbed, which may change the reactivity of the drug. Computer systems that have b e g u n to deal effectively with these problems exist at a n u m b e r of university and commercial laboratories. M e r c k ' s R a h w a y laboratory has developed a system that Feldmann refers to as "a shining example of what should h a p p e n in rational d r u g design." Chemists regard Merck's Molecular Modeling System as friendly because its use requires a minimal knowledge of computers. W h e n a chemist draws a r o u g h diagram of a molecule with a light pen, the Merck system automatically calculates the correct atomic distances, bond angles, and energy functions to produce a stereochemically correct model on the display screen. These can be compared to related crystal struct u r e s c o m p i l e d from X-ray d a t a , either from the Cambridge Crystallographic Data File, available through the Chemical Information System, or from M e r c k crystallographers. Merck also maintains two data bases of its own. O n e holds 150,000 structures produced by the c o m p a n y ; the other contains test results and biological-activity data on these structures. A program k n o w n as Compare superimposes modeled structures to find t h r e e - d i m e n s i o n a l atomic

patterns associated with the d r u g s ' biological activity. "It's always easier to do computer experiments than to do laboratory experiments," Gund observes. In precomputer days, drug development involved empirical testing, or screening, of t h o u s a n d s of substances to detect one showing a new type of desirable biological activity. This process was followed by systematic chemical modification of the best compound found in order to optimize its properties. " W i t h computer systems," says Gund, "we can more readily identify structures with the desired activity, then combine and modify t h e m in ways that lead to the discovery of n e w comp o u n d s or enhancement of the properties of k n o w n o n e s . " T h e c o m p a n y recently e m p l o y e d its system to create a potential diabetes medication. T h e c o m p o u n d is an analogue of somatostatin, a h o r m o n e that inhibits the release of glucagon. Glucagon raises bloodsugar levels by increasing p r o d u c t i o n of glucose. A n analogue is needed because somatostatin degrades too quickly in the body to be effective. A team at M e r c k ' s West Point, Pennsylvania, laboratory, led by Daniel F. Veber, a p p r o a c h e d this problem, by first isolating the active portion of the somatostatin molecule. Veber's g r o u p then employed Compare to model possible structures that could hold the active portion together and prevent its metabolic degradation. " T h e r e are so many possible structures that I'm not sure that we would have found the correct one without the help of the c o m p u t e r , " Veber o b s e r v e s . M o n t h s of computer analysis, laboratory synthesis, testing in rats, refining of the structure, and retesting finally yielded an analogue with greater potency t h a n somatostatin. It also showed increased duration of action, and decreased metabolic degradation. T h e c o m p o u n d is n o w being tested as a treatment for juvenile-onset diabetes.

Tribble Commercial molecular modeling systems are not limited to design of n e w drugs. C h e m i s t s at E.I. d u P o n t de N e m o u r s , Incorporated, use a group of 300 programs for work on pesticides, herbicides, plastics, films, and fibers, as well as o n drugs. T h e programs are brought together under an executive system dubbed Tribble, for the friendly furry creatures in the Star Trek television series. David A. Pensak, chief architect of the system, gave it the name because "it is friendly to chemists. Tribble has more than 100 h a p p y i n - h o u s e users

and [has] thousands of compounds r u n ning through it at one time." The system converts a light-pen drawing of a real or hypothetical molecule into a rough three-dimensional r e p r e s e n t a t i o n of the structure. It also determines the molecular shape having the least strain energy, the conformation most likely to be realizable. House molecules are filed along with structures from outside data b a n k s . (These include the C a m b r i d g e Crystal File of more than 28,000 X-ray crystal structures, the Protein Data Bank library of enzyme and other protein structures compiled by Brookhaven National L a b o r a t o r i e s , and the Protein S e q u e n c e File, which contains all published protein and nucleic-acid sequences.) These structures can be fed to a variety of programs that compute molecular properties such as area, volume, and charge distribution. Other programs perform q u a n t u m mechanics calculations, pattern recognition, molecular similarity searches, and superimposition of structures. Output programs display and manipulate molecular structures that can be viewed from various angles, rotated, or translated. There is also a c o m p o n e n t that permits researchers to create analogues by modifying segments of molecular structure on a screen without the need to redraw the entire molecule. D u P o n t chemists use T r i b b l e , w h i c h displays molecules comprising as m a n y as five thousand atoms, to determine precisely iAjh-\7 h o r H i ^ i r l o c ¥HO\T

dDcifrnon in fno

frarli-

tional way are active. A polymer-products g r o u p e m p l o y s the s y s t e m to design a m o l e c u l a r film to s e p a r a t e water from ethanol and make alcohol fuels more economical. Another team works on nonaddicting analogues of morphine. " D u P o n t has made a large commitment to the design and manufacture of biochemicals," Pensak notes, " a n d computers will play a prominent role in this effort. W e view the machines as internal consultants that make chemistry easier and more effective. But we always remind ourselves that c o m p u t i n g is not a substitute for thinking, and computers are no substitute for an experienced, intuitive chemist. T h e best system produces only a model of reality; that model has to be synthesized and tested to have any utility."

Making it with computers O n c e chemists roll u p their sleeves to turn a computer model into a new comp o u n d , h o w do they begin? Starting materials for synthesizing even relatively simple organic c o m p o u n d s may number in the

Vires model. A computer representation of an adenovirus. t h o u s a n d s . Once a start is made, there can be t h o u s a n d s of second steps. C h e m i s t s with experience in m a k i n g related c o m p o u n d s prefer certain starters and reactions, but these may not be the most efficient ways to synthesize a n e w molecule. W. Todd W i p k e , a chemist at the University of California at Santa Cruz, says s y n t h e s i s design exceeds chess in complexity: " T h e r e are more functional g r o u p s than chess pieces, more kinds of reactions than chess moves, and it is harder to recognize a °"ood s^mthesis than to recognize a c h e c k m a t e . " C o m p u t e r s determine all p o s s i b l e c o m b i n a t i o n s in s t r u c t u r a l analyses, so it is natural for chemists to seek their help for synthesis. Researchers first tried machines for this p u r p o s e in the 1960s. Seeking to systematize p r o c e d u r e s for p l a n n i n g o r g a n i c syntheses, W i p k e and Harvard University chemist E. J. Corey realized the potential of applying computers to the task. (See " C h o o s i n g C h e m i c a l R o u t e s , " Mosaic, Volume 5, N u m b e r 4.) By 1969, they had d e v e l o p e d a p r o g r a m that later b e c a m e k n o w n as Lhasa, for Logic and Heuristics Applied to Synthetic Analysis. T h e p r o g r a m considers all routes to a target molecule, rejects unworkable ones, and presents the most promising paths to chemists for their evaluation. In the Lhasa system, a user's light-pen diagram of the target appears on one screen; a second screen displays a list of strategies for taking the molecule apart chemically. T h e strategies use transforms, the reverse of chemical reactions that lead to synthesis. T h e p r o g r a m works backward to starting

reactions with easily available chemicals. T h e chemist chooses a strategy that he believes will lead to the greatest simplification of the target. Lhasa then implements the strategy using transforms stored in its library. Often, more than one transform will simplify a molecule, so Lhasa presents a collection of reactions that lead to s u b structures one chemical step away from the desired compound. T h e user rejects or accepts these offspring and then repeats the sequence of steps, treating each s u b structure as a new target. The result is a synthesis tree showing various routes from a molecule that may never before have been made to familiar reactions using offthe-shelf chemicals. " A n analysis of reaction paths to s y n thesize a molecule can be done in 20 m i n utes with a computer," Corey says. " W i t h out a computer it could take days or weeks. Some chemists might not be able to do it at all." Corey's program also counters the so-called eureka syndrome, the temptation of chemists working with pencil and paper to select the first workable route t h a t is f o u n d instead of s e a r c h i n g for other, possibly more efficient, routes. Although Lhasa has been under developm e n t since 1969, c h e m i s t s do n o t y e t employ it routinely for synthesis planning. Several universities use a teaching v e r s i o n , h o w e v e r , to d e m o n s t r a t e the principles of synthetic organic chemistry. " T h e p r o g r a m is available in 12 or 14 places, i n c l u d i n g d r u g and r e s e a r c h oriented chemical companies," Corey says. Before it becomes useful on an everyday basis, he points out, these companies must

MOSAIC January/February 1983

51

build or have access to libraries of transforms that will do the kinds of syntheses they wish to do.

Working forward Another'synthesis program entering the market works forward from known starters rather than backward from the goal product. C h e m i s t s T i m o t h y D . Salatin a n d William L. Jorgensen of Purdue University created C a m e o , the C o m p u t e r - A s s i s t e d Mechanistic Evaluation of Organic S y n thesis, which predicts the results of chemical reactions given starting materials and certain other conditions. Cameo applies to starting materials listed on a display screen a menu of reagents and mechanisms c o m m o n to a particular class of reactions. T h e system assigns a number to each product resulting from a reaction, and it displays a tree s h o w i n g each one as a b r a n c h g r o w i n g from a trunk—the starting materials. By calling u p each n u m b e r , a chemist can display the structure of the product and submit it to further reactions. Products of these reactions also are numbered and placed on the tree, so the user can see a sequence from beginning to end. A separate module is employed by Cameo for each of the m a n y molecular classes that Jorgensen referred to as " t h e guts of synthetic organic chemistry." In evaluating proposed reactions for feasibility, "a chemist who might have wasted a day in the lab on an ill-conceived reaction n o w can find potential problems in five m i n u t e s , " he says. " C a m e o allows its users to predict what will be in the pot at the end of a synthesis: whether or not the reactions p r o d u c e low yields or u n d e s i r a b l e b y products." T h e p r o g r a m also p r o d u c e s o r i g i n a l reactions and products. Jorgensen and his co-workers tested it by asking it to predict the results of reactions that yield wellk n o w n products. T h e system not only did this, b u t it also p r e d i c t e d u n e x p e c t e d products whose existence was later confirmed. These results are encouraging, but whether Cameo is u p to the task of routine synthesis planning or will become a vital cog in a black box system remains to be demonstrated. A third synthesis program, developed by Wipke and called Sees, for Simulation and Evaluation of Chemical Synthesis, has been used by industry since 1973. It helps chemists design and select syntheses of biologically i m p o r t a n t molecules. Like Lhasa, it works backward from the target compound. "Sees works in two w o r l d s , "

52

MOSAIC January/February 1983

W i p k e explains, " a world of r e a c t i o n s and a world of strategy concerned with the a r c h i t e c t u r a l p r i n c i p l e s of b u i l d i n g molecules." The chemist works in the strategy world to create a plan that contains structural c h a n g e s required to b r e a k u p a target molecule into chemically r e p r o d u c i b l e units. This plan, for example, might call for breaking certain bonds or leaving specific s u b s t r u c t u r e s intact. T h e p r o g r a m then searches the reaction world and selects only those reactions that satisfy the plan. "Strategies can be developed without [the chemist] knowing the content of the transform library, and new transforms can be added w i t h o u t altering s t r a t e g i e s , " W i p k e p o i n t s out. " I n Lhasa, no s u c h separation exists; new reactions require a reordering of steps in a strategy sequence, which is difficult to d o . " W i p k e ' s system is used b y a c®nsortium of seven of the largest p h a r m a c e u t i c a l companies in West G e r m a n y and Switzerland. They share a large library of reactions that is closed to other organizations. D r u g and chemical corporations in Japan and Sweden use Sees, as does a governm e n t l a b o r a t o r y in A u s t r a l i a . In the United States, Sees is available through Stanford, the University of Pennsylvania medical school, ADP N e t w o r k C o m p u t e r s , Merck and C o m p a n y , and several other organizations. T o illustrate Sees in use, W i p k e tells the story of a Swedish c o m p a n y that manually planned the synthesis of "a relatively simple molecule," but could not get the process to work properly. T h e y gave the problem to Sees, which generated what he calls " a n obvious solution b u t one that h a d not been c o n s i d e r e d . T h e S w e d i s h chemists laughed at first, then after some consultation decided that it would w o r k . " The company now produces kilogram quantities of the chemical by the Seesgenerated route.

Computerized metabolites W i p k e used parts of Sees to construct a program called Xeno for predicting the biological activity of foreign c o m p o u n d s , or xenobiotics, such as pesticides, drugs, and other chemicals not normally found in the body. " I n m a n y cases," he says, " t o x i c i t y , c a r c i n o g e n i c i t y , or a n o t h e r activity is due not directly to a foreign substance but to one or more of its metabolites. T h i s p r o g r a m identifies k n o w n and u n k n o w n metabolites. It determines their mechanism of formation, activity, and ultimate fate, tests models of metab-

olism, and aids researchers in planning experiments." T h e program exposes a foreign comp o u n d to every biotransformation k n o w n to exist in the species u n d e r study. It generates separate lists of metabolites for mice, rats, and h u m a n s . Because Xeno does not take into account factors such as excretion and membranes that block transport, it g e n e r a t e s more m e t a b o l i t e s t h a n an organism would. " H o w e v e r , overprediction is less of a p r o b l e m t h a n m i s s i n g metabolites," W i p k e says. T h e program does not predict quantities, b u t it does provide information on the importance of each biotransform and on the likelihood of its occurrence. Predictions by Xeno h a v e been tested successfully against m e t a b o l i t e s k n o w n from the literature. " I t also h a s f o u n d c o m p o u n d s u n k n o w n to r e s e a r c h e r s , " W i p k e relates. In 1978, he ran a d e m o n stration on the TCCD molecule, then considered the toxic c o m p o u n d in A g e n t Orange. "Everyone scoffed at the list of metabolites g e n e r a t e d b y the p r o g r a m because, they said, TCCD I s not metabol i z e d / " W i p k e recalls. But, he reports, in 1981 chemists found that TCCD is indeed metabolized, and some of the metabolites predicted by Xeno have been identified experimentally. Wipke solicits other problems for and evaluations of Xeno as part of his effort to coax the program from what he calls a "basic research s t a g e " into the world of everyday chemistry. While designing and perfecting the Inner cogs a n d gears of f u t u r e b l a c k b o x e s , researchers also want to organize all the k n o w l e d g e of o r g a n i c c h e m i s t r y Into a structured form that fits neatly into these boxes. Jorgensen's s t u d e n t s have begun organizing Information from the literature Into algorithms for c o m p u t e r use. " O u r eventual goal," he states, "is to systematize chemists' u n d e r s t a n d i n g into rules for teaching and doing c h e m i s t r y . " Wipke's effort involves writing programs "to look at molecules the w a y a h u m a n d o e s . " T h e first step, as he sees It, is to formalize the principles of chemistry so that they can be tested, s o m e w h a t like proving theorems in geometry. Included w o u l d be b o t h the p r i n c i p l e s explicitly stated in textbooks and those that exist Implicitly in the heads of chemists. T h e next step is to develop a system for machine r e p r e s e n t a t i o n of these p r i n c i p l e s . T h e final step is to build into a c o m p u t e r system the capacity to use the principles to reason and solve problems-—to do w h a t a skilled chemist does intuitively. •