Wait, that's not a double helix!


A DNA-based app idea

Like many a coder, I've been looking for an excuse to write a little iPhone app for some time. Recently I got interested in genetics and discovered that thanks to 23andMe, there was a way I could put some of the genetics I had learned to use in a simple app.

23andMe offer three products, bundled as one:

  1. Genotyping: Using a saliva sample, they determine your mutations at nearly a million locations in your DNA and make these digitally available to you.
  2. Ancestral interpretation of your genotype: Using the output from step 1, they provide you with extremely-detailed information about your ancestry (and, if both of you choose, the existence of any relatives who are also customers).
  3. Health implications of your genotype: Using the output from step 1, they provide you with health information like decreased / increased disease risk or drug sensitivity as well as phenotype information such as what type of muscle fibres you have etc. (Unfortunately they are currently not allowed to offer this service to new customers due to an unresolved negotiation with the FDA.)
Each of these three products are valuable but IMHO the output from the first, your genotype, is a goldmine of potential applications. Furthermore 23andMe actively encourage third parties to offer further services to their customers by providing an API which means customers can cleanly avail of such services (or choose not to do so). The API uses secure OAuth 2.0 authentication and is easy to use for all involved.

Look mom, I'm in the App Store!

After mulling a few ideas, eventually I decided that I would create a game that uses the user's DNA to provide an experience that would be unique to each person. I decided to impose the following constraints on this idea:

An obvious style of game satisfying the above constraints is a simulated card game in the style of the classic Top Trumps® school-yard game. I decided to provide the player with one card for each type of chromosome (thus 24 cards total) and that each card should carry five scores expressing properties of the DNA in that chromosome. Without further ado then, here are some screenshots of what I ended up with:

Click to view in App Store / iTunes

Evidently I needed to decide what the five scores for each chromosome type were to be. I imposed the following constraints:
  1. A score must not permit an interpretation that can allow the conclusion that any person has "better" or "worse" DNA than anybody else.
  2. It must be possible to express a score meaningfully as a number between 1 and 100.
  3. Ideally, a score should contain meaningful (and hopefully interesting) information about the chromosome's DNA.
  4. A score must be calculable efficiently using data in public domain.
  5. The values of each score must vary significantly amongst chromosome types (so that cards carry different values).
  6. Scores should express independent information (e.g., they should not correlate strongly with each other).
In order to describe the five scores on which I settled, it is necessary to recall some basic facts about DNA.

A primer on DNA

If, like me a couple of months ago, you know nothing about DNA and you have time and no objection to reading A-level material, then I would recommend the following notes as a starting point, followed by Brian Hayes's superb American Scientist article on the history of the genetic code. In any case I will try to sketch out the fragments we need below.

Your body contains many cells. Almost all of these cells contain a compartment called the nucleus. Inside the nucleus are 46 molecules known as your nuclear chromosomes. These 46 molecules come in 23 pairs that have the same structure (except for the sex chromosome pair in males). Finally there is a 47th, much smaller, molecule stored outside the nucleus of your cells in compartments called the mitochondria. This is your mitochondrial chromosome. Your nuclear chromosomes are huge molecules, so big in fact, that they can be seen through an optical microscope (when stained and adequately prepared):

Optical micrographs of human chromosome pairs (karyogram)

One of the reason chromosomes are so important is that they are the molecules of heredity. Apart from much less significant epigenetic mechanisms, the reason you have many traits in common with your parents is because large chunks of your chromosomes (genes) are just copies of corresponding regions in your parents' chromosomes. Indeed one member of each of your 23 nuclear chromosome pairs comes from your mother and one comes from your father. Each of the 23 nuclear chromosomes that your mother provided are a mixture of the two members of her corresponding chromosome pair and likewise with your father. Your final chromosome, the mitochondrial chromosome, is an exact copy of your mother's mitochondrial chromosome (assuming this or something else very unlikely did not occur).

A wonderful thing about chromosomes is that they are linear molecules: their primary structure is determined entirely by the sequence of bases that occur. These bases are relatively simple molecules and only four occur in your chromosomes:

The primary structure of your chromosomes is thus determined entirely by a long string over a four-character alphabet. Except for mutations (and ageing effects etc), these 47 strings are the same for all of the cells in your body and are what people are talking about when they speak of "somebody's DNA". In fact each chromosome contains two such long strings called the positive and negative strands but each strand determines the other according to the rule: "interchange A and T as well as C and G". This is why people often speak of "base pairs". Each strand provides one half of the rungs in DNA's double helix. We focus on one strand and regard a chromosome as a long string of characters, each of which is drawn from the alphabet A, C, G, T. Here is a picture that gives the idea:

Schematic diagram of DNA base sequences in a chromosome

The different types of chromosomes can be distinguished by their string length and an ordering (which roughly corresponds to decreasing order of length) has been agreed upon so that we can meaningfully speak of chromosome 1, ..., chromosome 22. There is a minor complication with the 23rd chromosomes, the sex chromosomes, which does not matter for us.

The longest, chromosome 1, has a length of about 247 million characters (i.e., bases) and the shortest nuclear chromosome, number 21, has a length of about 47 million characters. The length of the mitochondrial chromosome is a mere 16 thousand or so characters and the total length of your 47 chromosomes is about six billion characters. Since each character is drawn from an alphabet of size four, it takes two bits to record each character and so it would take about 750 MB to store your entire genome (just over an 80-minute CD's worth of information as a friend observed).

The five scores

Below are descriptions of the five scores I assign to each chromosome in DNA-Trumps. I turn each of these into a score between 1 and 100 by calculating a raw value for each of the 23 chromosome pairs and applying the unique affine transformation that maps the lowest score to 1 and highest score to 100. This is an easy way to satisfy the conditions 1 and 2 of the constraints I decided each score must satisfy (listed above). I give the card corresponding to the mitochondrial chromosome a maximum score of 100 in each category so that the user has a 'trump' card that they can use to turn the initiative in their favour at some point during gameplay.

For emphasis I want to say the following explicitly. As a result of affine transformation mentioned above, it is completely meaningless to claim that anybody with higher or lower scores for their chromosomes in any category has "better" or "worse" DNA.

Strength (how strong is your DNA)

We have mentioned that DNA has two strands and that if one of the bases in a strand of DNA is A or T, then the corresponding base in the other strand will be T or A (respectively). Furthermore in this case there will be a double hydrogen bond holding the chromosome together at that location. Similarly, if one of the bases in a strand is G or C, then the corresponding base in the other strand will be C or G (respectively) and there will be a triple hydrogen bond. For this reason, the higher the number of GC bonds versus AT bonds in a given region, the stronger is the force holding together that part of the chromosome. The following image (which I have, er, borrowed) should give the idea:

Hydrogen bonds holding two strands of DNA in a chromosome together

Depending on a person's genotype, they can have more AT bonds or more GC bonds. In fact nobody really has "stronger" or "weaker" chromosomes for several reasons including that fact that over 98% of any two people's DNA is identical. Nevertheless, for the purposes of the game we count how many GC bonds a chromosome has and compare it to the number of AT bonds and so get a measure of each chromosome's "strength".

Weight (how heavy is your DNA)

The four bases that occur in DNA have different relative atomic mass as can be seen in the below table:
Base Relative atomic mass (to nearest amu)
Adenine 135
Cytosine 110
Guanine 151
Thymine 126
Depending on a person's genotype they may have more or less of each base in a given strand of DNA and so by counting these we can get a toy measure of how heavy that strand of their DNA is.

In fact nobody really has "heavier" or "lighter" chromosomes for several reasons. Firstly, as we know, all chromosomes have two strands in which the bases pair up: A with T and C with G and the total masses of each pair are almost identical (each are about 261amu). In addition over 98% of any two people's DNA is identical and finally there are many other molecules that also contribute to the DNA mass.

Nevertheless for the purposes of the game, the more of the heavier bases in the positive strand of one of your chromosomes, the higher the score it gets.

IC diversity (how similar/different are your parents' DNAs)

IC diversity stands for inter-chromosomal diversity.

As we know, except for the sex chromosome pair in males, all of the nuclear chromosomes come in pairs with corresponding structure and one member of each pair comes from the mother and one from the father. Furthermore, there is also a region of structural similarity for the sex chromosomes in males even though they do not have the same overall structure. It is thus possible to compare the DNA that a person received from their father to that which they received from their mother for each of their 23 nuclear chromosome pairs.

For the purposes of the game, chromosomes with a bigger difference between maternal and paternal DNA get higher scores. Of course this is not "better" or "worse" in reality, it's just another way to use your DNA in the game.

Scarcity (how rare is your DNA)

Except for identical twins (or triplets etc), no two people will have the same DNA. This is because there are millions of locations in their DNA where there may be mutations and the chances of anybody else having precisely the same set of mutations is vanishingly unlikely.

Certain combinations of mutations are rarer than others. Some independent mutations will occur in only 1% of people and so if a person has enough of these, then their DNA is very rare indeed. Furthermore, data about which mutations are rare and which are common is publicly available thanks to the HapMap project.

For the purposes of the game, hundreds of thousands of the user's mutations are examined and the HapMap data is used to determine which chromosomes are rarer than others. Rarer chromosomes higher scores.

IP diversity (how geographically diverse is your DNA)

IP diversity stands for inter-populational diversity.

The likelihood of many DNA mutations varies according to a person's geographic location (or that of their ancestors). Data that quantifies this variation in likelihood was collected and made publicly available by the HapMap project which includes data for 11 different human populations, corresponding to parts of Africa, America, Asia and Europe. This makes it possible to estimate the probability that the person belongs to each of these 11 populations.

For the purposes of the game, a probability distribution that expresses the likelihood that the user belongs to each of the 11 populations is calculated independently for each nuclear chromosome pair. The raw score given to a chromosome pair is then the entropy of this distribution. (Thus more geographically diverse chromosomes get higher scores.)

In fact the game only uses a tiny subset of the data provided by 23andMe to estimate the probability distributions because otherwise there is not enough variation in score across chromosomes for many people. This is because even the DNA from just one chromosome is often enough to determine a person's heritage very accurately.

Step three: profit!

If you clicked on the link to the DNA-Trumps app itself, you will probably have noticed I am currently charging a whopping $0.99 (or €0.89 etc.) for it.

It is perhaps optimistic to hope that I will sell more than a handful of copies given that it is available only to customers of 23andMe who own iOS7-running iThings and who happen to like this quirky idea, but I can dream! Anyway, maybe I shouldn't admit it publicly, but I plan to make the app free eventually (probably a few months' time). It would be nice if I could at least cover my developer license fee and the cost of the artwork that I licensed from Stockfresh.

Lastly I wanted to send a thank you to Nick Lockwood. As this was my first experience developing for iOS, I was surprised to discover that Apple do not provide a native carousel object and so I was delighted when I came across Lockwood's superb iCarousel library.