Simplified molecular input line entry specification

The simplified molecular input line entry specification or SMILES is a grammar and vocabulary for specifying the structure of a pure substance using common ordinary teletype-supported characters, such as the ASCII code. David Weininger developed SMILES while at the Environmental Research Laboratory of the Environmental Protection Agency at Duluth, Minnesota, United States.

SMILES is, in essence, a language for specifying a chemical compound, and is often precise enough to be a unique identifier.

Advantages
SMILES requires no special constructs, nor any special data type. Every character in it is part of the original specification of the American Standard Code for Information Interchange. A SMILES structure can reside in a database as a "varying character" or "string" data type. Moreover, any plain-text editor can produce a SMILES structure.

SMILES Grammar
As a language, SMILES has a vocabulary and a grammar, with definite rules.

Atoms
Atoms are the nouns of the SMILES grammar. One represents each atom by its chemical symbol. Usually one encloses the symbol in square brackets, like this: [Cl]. However, the following organic subset symbols may appear without the brackets: B, C, N, O, P, S, F, Cl, Br, and I. These include the halogens, which would normally bond to only one other atom in any case, and other atoms that are assumed to be bound to hydrogen if they are not explicitly bound to something else. An atom participating in an aromatic ring structure is listed in lowercase.

The use of the brackets is significant. For example, [S] refers to elemental sulfur, while the symbol S represents hydrogen sulfide, which has two atoms of hydrogen bound to the one of sulfur. (However, Cl-Cl refers to the diatomic molecule of chlorine, while Cl refers to hydrochloric acid.)

Charges and positions of atoms
Charge signs (+ and -) and digits giving the multiple of a charge or the position of an atom are the adjectives (and sometimes the adverbs) of SMILES grammar. An ionic valence is a classic application. For example, [Fe+2] is the ferrous or iron (II) ion. Note that SMILES does not require, nor use, superscripts or subscripts.

One does not multiply atoms themselves (except for atoms of hydrogen) by using numbers. Instead, one repeats the atomic symbol as many times as the atom appears.

Bonds
Bonds are the verbs of the SMILES grammar. The symbols for bonds are absurdly simple:

To simplify things even further, one may omit the - and : symbols for atoms that are adjacent to one another and have single or aromatic bonds joining them. This is the reason for representing an aromatically bound atom in lowercase instead of in UPPERCASE.

Thus the SMILES for diatomic oxygen is O=O; that for carbon dioxide is O=C=O; for diatomic nitrogen, N#N; for hydrogen cyanide, C#N; for acetylene or ethyne, C#C; for hydrazine, N=N.

Branches
Branches are the subordinating conjunctions of the SMILES grammar. A structure that branches from the main line is enclosed in parentheses. Nesting and stacking of branches is permitted. An atom other than carbon in a linear structure would also receive a branch. Thus the SMILES for chloromethane (formerly called "methyl chloride") would be C(Cl), and that for tetrachloromethane ("carbon tetrachloride") would be C(Cl)(Cl)(Cl)(Cl).

Carboxylic acids are a common branching structure. The SMILES for acetic acid, for example, is CC(=O)O.

Rings
To write a cyclic or ring structure, you "break" one of the bonds and write the structure as a line having digits following the atoms in the broken bond. Thus the SMILES for cyclohexane is C1CCCCC1. If a given atom is part of more than one ring structure, and you have to break more than one bond, you then use a different digit for each broken bond, in order to convey how to re-join the atoms.

By convention, aromatic ring vertices are written in lowercase. Thus the SMILES for benzene is c1ccccc1 and that for pyridine is n1ccccc1.

Disconnected Structures
A simple dot (.) serves as the most common example of a coordinating conjunction in SMILES. Two structures not having a covalent bond of any kind to join them are considered disconnected, and are joined with a dot. This is the proper method for representing ionic compounds. For example, the SMILES for sodium chloride is [Na+].[Cl-]. The SMILES for sodium acetate is [Na+].[CC(=O)O-].

Isomeric SMILES
This refers to an advanced SMILES grammar designed to illustrate isotopes and isomers, including chirality, which is the varying orientation around a double bond or in a tetrahedral structure. (Triple bonds have no varying orientation, and one describes single-bond orientation by using the branching rules described above.)

Isotopes
A number that precedes an atomic symbol always refers to an isotope of the relevant element. The number is the atomic mass number of that isotope. One must always enclose an atomic symbol having an isotope specifier in square brackets, even if it belongs to the organic subset. Thus the SMILES for 13C methane is [13CH4].

Double bond configuration
The symbols / and \ represent directional single bonds to double-bound atoms of carbon. (Most other atoms have lower normal valences than carbon and thus would not be relevant here; nitrogen, for example, could bind only one other atom if it were already doubly bound to something else.) These are locally chiral structures. This is a substitute for the prefixes cis- (on this side of) and trans- (across) to describe orientation around a double bond. For example, the SMILES for trans-dichloroethene is Cl/C=C/Cl, but that for cis-dichloroethene would be Cl/C=C\Cl. One can give a complete specification of all the double-bond orientations, or a partial specification that specifies one set of directions but not another.

Configuration About a Tetrahedral Center
The symbols @ and @@ represent two special "adverbs" in SMILES grammar: they indicate the proper method of describing the orientation of different atoms or branch structures about a single atom of carbon. @ means anticlockwise, and @@ means clockwise. For example, valid SMILES listings for alanine are N[C@@H](C)C(=O)O and N[C@H](C)C(=O)O. Each of these two specifications names a different enantiomer of alanine.

General Chiral Configuration
SMILES has highly specific rules for specifying tetrahedral, allene-like, square-planar, trigonal-bipyramidal, octahedral, and other orientations. For details, read the relevant text here.

SMILES Specifications for Reactions
SMILES can handle certain reactions that involve the displacement of one atom by another, or a change in the molecule's structure. The special conjunctions > and >> separate the reactant(s) and the product(s). The complete syntax is as follows:


 * 1) To specify reactants only, write [A].[B]>>[C].[D], where A, B, C, and D are separate and disconnected participants in the reaction. Thus to specify the dissociation of acetic acid when exposed, say, to sodium hydroxide or "caustic soda," the SMILES would be CC(=O)O.[Na+].[OH-]>>[CC(=O)O-].[Na+].O . In English, that reads "Acetic acid reacts with caustic soda to yield sodium acetate and water." Note that the SMILES representation of this reaction clearly shows that the sodium ion does not change itself in any way, shape or form. Rather, acetic acid loses an atom of hydrogen to become acetate ion, and hydroxide ion picks up a hydrogen atom to become water.
 * 2) If an agent participates in the reaction without undergoing any chemical change, the syntax is [A]>[B]>[C] . Here, A, B, and C represent reactant(s), agent(s), and product(s). (More than one molecular or ionic species in each category are connected by dots; see above for "Disconnected structures.") An agent could be a solvent or a catalyst.

Note that the > symbol never appears in a molecular SMILES, and at most two > characters can appear in a reaction SMILES.

Related References

 * "SMILES 1. Introduction and Encoding Rules", Weininger, D., J.Chem. Inf. Comput. Sci. 1988, 28,31. This is the classic paper.
 * SMILES - A Simplified Chemical Language from Daylight Chemical Information Systems, Inc.
 * SMILES by Wikipedia.

Simplified molecular input line entry specification