Let's start with some basics: 1. The "transition probabilities are given as input data: We can write this compactly as: Note that the columns of this matrix sum to 1 (why?), but that the rows do not (why not?). 2. What about ? (can be found several ways) Let's consider the case in detail: . Similarly, . Hence, Note that . 3. What about the joint distribution, . Consider : Hence, noting that rows correspond to X and colums to Y, Also, note that , and that , which can be computed by summing the rows of . Similarly, , which can be computed by summing the columns of . 4. Next, we would like to compute : Similarly, Note that the rows sum to 1, but the columns don't (why?). Now we can begin our entropy-related calculations: 5. 6. Why is ? 7. Next, let's calculate : 8. What about the joint entropy: But we could also calculate this using the chain rule: 9. Now, we want to compute the less intuitive : Method 1: Straightforward application of the basic equation Method 2: Using intuition :) and, is just the average of these: 10. Finally, from Fano's Inequality: and, An error occurs when , which only happens when . Recall that , , and , so: and, . 11. While we are at it, let's check a few more relationships involving mutual information: Example: Discrete Memoryless Channel (Binary Erasure Channel)