Let's start with some basics:

1.	The "transition probabilities are given as input data:

	We can write this compactly as:

	Note that the columns of this matrix sum to 1 (why?), but that the rows do not (why not?).

2. What about ? (can be found several ways)
	Let's consider the case  in detail:
.
	Similarly,

.
Hence,

Note that .

3.	What about the joint distribution, . Consider :

	Hence, noting that rows correspond to X and colums to Y,

	Also, note that , and that , which can be computed by summing the rows of . Similarly, , which can be computed by summing the columns of .

4.	Next, we would like to compute :

	Similarly,

	Note that the rows sum to 1, but the columns don't (why?).

Now we can begin our entropy-related calculations:

5.	
6.	   Why is ?
7.	Next, let's calculate :

8.	What about the joint entropy:

	But we could also calculate this using the chain rule:

9.	Now, we want to compute the less intuitive :

	Method 1: Straightforward application of the basic equation

	Method 2: Using intuition :)

	and,  is just the average of these:


10.	Finally, from Fano's Inequality:

	and,

	An error occurs when , which only happens when .
	Recall that , , and , so:

	and,
.

11. While we are at it, let's check a few more relationships involving mutual information:

Example: Discrete Memoryless Channel (Binary Erasure Channel)