Sunday, May 4, 2014

Causality versus correlation

One of the challenges in data analysis is to determine whether correlated data implies causation.  For instance, sales of sunscreen and ice cream may be correlated during the year, but that does not mean that purchase of sunscreen leads to purchase of ice cream or vice versa.

Correlated data means then when one type of data occurs, so does another type of data and when one type of data is missing, so does the other type of data. From a logic point of view, consider two statements $P$ and $Q$ and we look at when they are true or false from observational data.  The presence or absence of these 2 statements being true can be collected in a truth table.  Let us assume that correlated data means the truth table has the following form (although the case of tautology below shows that this is not always the best assumption):

$P$$Q$$P$ and $Q$ are Correlated
001
01?
10?
111
i.e. we observe that $P$ and $Q$ are either both false or either both true.  There are 4 ways to fill in the 2 question marks above.

In mathematical logic, the notation $P\rightarrow Q$ is interpreted as P implies Q.  The only time this statement is false is if P is true, and Q is false.  Logically it is equivalent to $\neg P \vee Q$ and to its contrapositive $\neg Q \rightarrow \neg P$ and has the truth table:
$P$ $Q$ $P\rightarrow Q$
00 1
0 1 1
1 00
1 1 1

However, this must not be interpreted as P causing Q, as $P\rightarrow Q$ is still true if $P$ is false and $Q$ is true.  Just because the P being true resulted in Q being true, does not mean that Q cannot be true independent of P.  The lesson here is that the definition of "P implies Q" is different from our definition of "P causes Q".

The following truth table of $\neg(P\oplus Q)$, the exclusive-nor of P and Q, is a more appropriate way to capture the meaning of "$P$ causes $Q$".  This is also equivalent to the statement $P$ if and only if $Q$, i.e. $(P\rightarrow Q) \wedge (Q\rightarrow P)$, also written as $P\equiv Q$.

$P$$Q$$\neg(P\oplus Q)$
001
010
100
111

However, we have a dilemma here.  The same argument can be used to say that this truth table is appropriate to indicate that $Q$ causes $P$. This shows the difficulty of deriving causation solely from observational data.  To determine causality we need a method to apply control, say change the truth value of P and see how it affects the truth value of Q.

What about the other 2 ways to fill in the question marks? The following truth table corresponds to
$Q\rightarrow P$:

$P$$Q$$Q\rightarrow P$
001
010
101
111

Shows that $Q$ implies $P$, but again not necessary that $Q$ causes $P$.

The following truth table corresponding to a tautology, and indicates that P and Q are uncorrelated since
any combination of the truth values of P and Q are possible.

$P$$Q$1
001
011
101
111

Thus just the fact that we always observe either $P$ and $Q$ both being true or both being false does not mean that they are correlated, it could be that we haven't observed the other combinations yet.