## Dive into the fundamental elements of Bayesian networks and their role in probabilistic graphical models

This article is part one of a two part series where we explore the theory behind Bayesian Networks and then build one using the pygm Python library, among others. This part is dedicated to reviewing the essential probabilistic theory that enables us to understand and draw inferences from any network.

The goal of this article is to demonstrate the essential probabilistic ideas behind Bayesian Networks so that in part two, we can begin building one. I am going to assume throughout the article that the audience understands what random variables are , and are at least familiar, or have been exposed to: joint probability distributions, conditional probability distributions, conditional relationships, independence between random variables, conditional independence between random variables, probability mass functions, cumulative distribution functions, and Bayes Theorem.

We will be doing a lot of review today so don’t worry if you feel you need to brush up on your understanding of these concepts. I will provide the data, explanations, and proprietary visualizations for each section of the article, so that you can follow along.

**OVERVIEW OF THE DATA**

I am one for understanding what the final destination looks like when working on any project. Therefore, I will do a quick overview of the dataset we will use in part two. I have also provided a visual representation of what the final network will look like.

The dataset that we will be building the Bayesian Network on, consists of cab company data and can be pulled, or directly downloaded from my personal GitHub. You will have a choice between the *.csv* version and a *.pkl* version. A significant amount of data preprocessing and feature engineering has already been performed *(will go over that in next article in the series)*. The dataset consists of fifteen random variables, for which each contains close to a half-million observations

`from pandas import read_csv`

`df0 = read_csv('mdf.csv')`

df0.head()

**WHAT ARE BAYESIAN NETWORKS?**

Bayesian Networks are a type of probabilistic graphical model, that visualizes the conditional dependency relationships between random variables. They can be discrete or continuous, and primarily consist of nodes *(random variables)*, and edges *(line with an arrow at the tip)* which connect the nodes. Moreover, each node has a joint probability distribution associated with it and therefore a conditional probability distribution associated with it, or *CPD*. The *CPD* represents the probability of the random variable given it’s parent. Once the Bayesian Network is built, we can then extract information or insight from the network through the use of queries *(probabilistic questions we can ask it)*.

I chose to arrange the network we will be building today into what resembles a tree data structure. I believe this will help the audience build an intuition for the composition and directional relationships inherent within the network. While Bayesian Networks *can* form a tree structure, like the one below, they can be of any shape or complexity.

Looking at the network above, what exactly is a conditional relationship, or a conditional probability distribution? To understand everything that is going on in the diagram above, we will review most of the essential concepts, and then zoom into a small example that helped me clarify many of these ideas some time ago. The review and example will highlight the two most important features of a Bayesian Network; It’s distributions, and the relationships between those distributions.

REVIEW: JOINT, CONDITIONAL, & MARGINAL PROBABILITY DISTRIBUTIONS

**Marginal Probability Distributions**

**A marginal distribution** can be thought of your everyday, garden-variety univariate probability distribution nested inside a joint distribution. So, if we have the joint distribution defined as *P(X, Y)* , the marginals of this joint distribution would be *P(X)* or *P(Y)* in isolation.

**Joint Probability Distributions**

A **joint probability distribution** is a probability distribution with two or more random variables that models the probability of* (X and Y* ) happening simultaneously. Joint distributions can be discrete, continuous, or a combination of both. Today we will be focusing on the discrete type. Not only are they easier to construct, much of the intuition from discrete distributions can be applied to continuous distributions. The probability mass function (*PMF*) and the cumulative distribution function (*CDF*) for a joint probability distribution are stated as follows:

To construct any discrete joint distribution we need to know three essential values.

- We need to know how many random variables the joint distribution will be comprised of.
- We need to know how many states or levels each discrete random variable can take on.
- We need to know the probability of each combination of states that each random variable can take on.

What I mean by* ‘states’* is that they are the different categories that a discrete random variable can possess or be fit into. For example, if we were to build a joint distribution comprised of the two random variables *day_of_week* and *payment_type*, to get the probability of each combination of states, we would take the Cartesian product composed of each set of states derived from each random variable. Then, we would assign probabilities to each tuple of the cartesian product. These probabilities reflect the likelihood of observing a specific combination of states. This allows us to analyze the simultaneous behavior of the random variables.

But, how do we acquire the probability values like *0.1369*, or *0.0913*? The best way to visualize this is, in my opinion, is to think of a dataframe that contains the *day_of_week* random variable and the other,* payment_type*. To find P(*day_of_week* = monday, *payment_type* = cash) we need only to count the number of rows for which this probability statement , or what I like to refer to as a query, is true, then divide by the total number of observations (number of rows in the dataframe). If we do this for every combination of states in {*day_of_week*} x {*payment_type*}, then arrange them into a table, the result would be the joint probability distribution of *day_of_week* and *payment_type, *as seen above*.*

**Dependence & Independence of Joint Distributions**

**When events are independent, the occurrence or non-occurrence of one event does not affect the probability of the other event.**

Most of us have seen the product rule for independent events, which is a definition derived from the fundamental axioms of probability. It states that the probability of (*X and Y* ) happening at the same time is the product of their individual probabilities if and only if (*X and Y* ) are independent.

This means that to check any joint distribution for independence, we have to multiply every marginal probability of *X *with every marginal probability of *Y*, and if the product of each individual product equals their respective joint probabilities, then our random variables are independent. Meaning one does not affect the other. Or, in terms of Venn diagrams, there is no overlap in area between the two distributions.

Moreover, most of us have seen the product rule for dependent events, which is also a definition derived from the fundamental axioms of probability.

It states that the probability of (*X and Y) *happening at the same time is the product of *X’s* marginal probability and the conditional probability of X given *Y *has occurred also.

If we perform the test mentioned in the ‘*independence*’ section above, and at least one joint probability does not equal the product of it’s marginals, then we can conclude our random variables are dependent. Or, in other words, thinking in terms of Venn diagrams, there is an overlap of area in the distributions.

**Conditional Probability Distributions**

When I was still a student, one of the most difficult concepts for me to understand in probability theory was conditional probability distributions. This is part of the reason why I chose to write my first article about the topic. Not only do I get the chance to share my understanding with others, it is a great exercise in the Feynman Method.

What finally clicked for me, and for those of you who still are confused about the topic, hopefully for you in a moment, is when I learned that a conditional probability distribution is always a subset of a joint distribution. Meaning, that when we query a joint distribution *(put restrictions/conditions on the sample space)*, what is returned is a smaller subset of probabilities, or in other words, what is returned is the proportion of the joint distribution that makes your query true. So, while, in the context of a dataframe, the joint distribution captures the proportion of rows in a data set which make a query true, the conditional distribution captures the proportion of the joint distribution that make a query true.

The formula above is the conditional probability formula. It is used to find individual conditional probabilities.

To simplify the idea, we can think of it in terms of percentages. We want to know what percent of the marginal distribution of *Y* the joint occurrence of (*X and Y*) is.

**Dependence & Independence of Conditional Probability Distributions**

Now that we’ve helped demystify the concept of conditional probability distributions, let’s now focus a little deeper on dependence and independence within these distributions. Understanding these relationships is crucial, as they form the backbone of Bayesian Networks and drawing inferences from them.

**Conditional Dependence & Independence**

Conditional dependence and independence are concepts that emerge when we explore how variables behave in the context of specific conditions *(evidence)*. It is important to remember that these relationships are not inherent properties of variables bur rather contextual. This means that variables can be dependent or independent depending on which context you are viewing them from.

For example, lets say that we are interested in how the number of *hours_studied *prior to a test affects *test_score*. If we first gather data from a single classroom on a campus, we might discover a strong correlation between *hours_studied* and *test score. *However, if we expand the dataset to include all classrooms on campus, we might find that the relationship between *test_score* and *hours_studied* varies greatly. This could imply independence.

So, while in the context of a single classroom *hours_studied* and *test_score* were, indeed, dependent, when not controlling for classroom and zooming out to a higher level of abstraction to include all classrooms the dependence disappears.

When variables are conditionally dependent, it means that their behaviors are interconnected, when certain conditions are met. In contrast, conditional independence arises when variables become free of each others influence under certain conditions.

In essence, it means that a *dependent* relationship between two random variables can be broken by the introduction of a third, or, an *independent* relationship between two random variables can be formed through the introduction of a third.

**Tying Everything Together…**

Joint probability, as we have already seen, is the probability of two or more events happening simultaneously. This is a separate idea when considering marginal, and conditional probability. To sum up what we have reviewed so far lets take a look at another example that can simply illustrate these ideas.

**Example 1**:*Say we have two random variables that can either be true or false. The first being, ‘**I mow the lawn’,**and the second being**‘whether it rains’.**To form the joint probability distribution of these two random variables we have to account for the entire sample space, which is the space which contains all possible combinations of states the random variables can take on together.*

As we can see, the joint distribution sums to 1, therefore, we can conclude we have stated all possible events that can occur with respect to our two random variables.

To find the marginal probabilities, *P(I mow the lawn)*, and *P(Rain)*, from our joint distribution, we just reference the probabilities for which the statement *‘I mow the lawn’* is true, and sum them together:

The same goes for the marginal probability of rain or any other random variables we may have in our joint distribution:

Now, keeping the conditional probability formula in mind, lets query a singular conditional probability using the conditional probability formula.

If we step back and look at exactly what we are doing here, we can see that conditional probability, when thinking in terms of Venn diagrams, is calculating the percentage of *X* that overlaps with *Y.* Therefore, *66.7%* of the intersection of (*X and Y*) is of me mowing the lawn when it rains.

**Bayes Theorem**

I saved Bayes Theorem for last, because I believed that by understanding everything up until now, it would make it easier to understand in the long run.

Mathematically, Bayes’ Theorem is expressed as:

Bayes Theorem is named after its founder, Reverend Thomas Bayes* (1702–1761)*, who was an 18th-century statistician. It provides a way to update our beliefs *(probabilities)* about an event or hypothesis based on new data, or information.

In practice we start with an initial belief *(prior probability)*, collect data, calculate the likelihood of the data under various hypotheses, and then use Bayes’ Theorem to update our *(posterior or final probability)* based on the observed data. This is a fundamental concept in the Bayesian flavor of statistics. Although, I will not mention anything further about Bayesian statistics, it is definitely worth learning what it is about.

Now, suppose a researcher was performing a study about mowing lawns and how rain affects me from doing so. Lets further assume that the only data the researcher has available is the joint distribution we have been working with. What if the researcher wanted to estimate the *P(it rains | I mow the lawn)* using our joint distribution? Could they do it using Bayes Theorem?

Yes! As we can see, there is a subtle, but, important difference in the two types of conditional queries. The difference lies in what the primary variable of interest is. In the original joint distribution, we were primarily interested in probabilities associated with the lawn, and the effect rain had on me mowing it, whereas, the researchers query, is primarily interested in probabilities associated with rain, and the effect me mowing the lawn had on *its *chances.

Original Conditional Query:

Researchers Query:

The difference in queries is an important observation, and part of the reason why Bayes’ Theorem is so useful. Information about rain, and how me mowing the lawn affects its chances is not readily apparent at first glance of the original joint distribution. However, to query that information from the joint distribution we just need to plug the appropriate information into Bayes’ Theorem, like we did above. Notice the intermediate calculation: *P(YX)*, which is* P(it rains | i mow the lawn) *that is a part of Bayes Theorem, it is also known as the likelihood.

**Bayes Theorem & Conditional Probability Formula**

So, what are the differences between Bayes’ Theorem and the conditional probability formula? Just looking at the two formulas we can see:

*Conditional probability formula uses the joint distribution explicitly, whereas, Bayes’ Theorem uses it implicitly within the likelihood term.**Bayes’ Theorem is more general than the conditional probability formula.*While both the conditional probability formula and Bayes’ Theorem are mathematically equivalent, the choice of which formula to use relies heavily on the context of the problem and how much information we have regarding the problem. For example, if we have a fully defined joint distribution in front of us, we may choose to use the conditional probability formula, whereas, if we already have conditional probabilities calculated for different experiments, we might use Bayes’ Theorem. Bayes Theorem uses the likelihood term P(BA) which attaches to hypothesis of specific experiments, whereas, probability attaches to outcomes, or events.*Bayes Theorem uses the likelihood term P(BA) which is used in the context of hypotheses of specific experiments, whereas, probability attaches to outcomes, or events.*The main takeaway is that the terminology depends on context.*Bayes’ Theorem incorporates the law of total probability in the denominator.*Which is used simply to capture all possible ways or scenarios in which the event of interest can occur within the entire sample space.

**BAYESIAN NETWORK EXAMPLE**

**Example 2**:*Lets say you have a new burglar alarm. It is fairly reliable at detecting burglary, but also sometimes responds to minor earthquakes. You have two neighbors, John and Mary, who promised to call you when they hear it.*

John always calls when he hears the alarm, but sometimes confuses telephone with the alarm. Mary likes loud music and sometimes misses the alarm. Here is what this network looks like:

As we can see, even this small example of a Bayesian Network can become quite complicated pretty quickly. Therefore, in preparation for part two of this article, lets slow down, and take the time to get a feel for how to manually analyze these networks. This will help tremendously when interpreting the results that we get from the network that we code in Python.

STRUCTURAL & RELATIONSHIP ANALYSIS

We have a total of five random variables represented by the orange nodes; *burglary*, *earthquake*, *alarm*, *john calls*, and *mary calls. *At this stage of the analysis it is important to start to get a feel for how the random variables are related to each other in reality and how they appear to be related with each other in the directed graph.

**Random Variables: Burglary & Earthquake**

Logically, we know that* burglary *and *earthquake* are independent of each other. Meaning, How often burglaries occur says nothing about how often earthquakes occur. Moreover, since *burglary* and *earthquake* are considered root nodes in the context of this network, the only probabilities associated with them are their marginal probabilities:

Makes sense right? There is nothing that directly affects or causes them (within the context of the network), and they are not child nodes of any other random variable. Therefore, their probabilities are not affected. This implies* burglary* and *earthquake* are marginally independent of each other prior to any observation. However, interestingly enough, once we *do* know the status of the alarm *(whether it has been activated)*, *burglary* and *earthquake* become conditionally dependent on one another. Why is this case, if prior to the alarm, *burglary* and *earthquake* were marginally independent? This is because the alarm provides information about the potential causes *(burglary and earthquake)*, making their probabilities dependent on each other.

This hidden dependency is related to an important principle in probabilistic reasoning called *‘explaining away’*.

**Intuitive Explanation of Explaining Away**:

- Before we Observe the
*Alarm (A)*:

- Burglary and Earthquake are marginally independent, meaning they don’t affect each other’s probabilities, as stated in the previous section.
- You might think of these two events, B and E, as potential causes for the Alarm to go off.

2. Observing the *Alarm (A)*:

- Let’s say you observe the Alarm being activated
*(A = true)*. - Now, we have evidence that suggests the Alarm could be caused by either burglary
*(B)*or earthquake*(E)*, or both. - Initially, we might lean towards one of them, let’s say, burglary, as the primary cause because it seems like burglary might have a higher chance of occurring.

3. Explaining Away:

- However, explaining away comes into play when we consider the possibility of both causes, burglary and earthquake.
- If we receive additional evidence, say, strong evidence of an earthquake
*(E = true)*, even though we already observed the alarm, it can make us reconsider our initial belief in burglary as the sole cause. - The presence of evidence for earthquake
*“explains away”*our initial inclination to attribute the alarm solely to burglary.

**Random Variables: Alarm**

Alarm is the child node of *burglary* and* earthquake*, and is considered the parent of *john calls* and *mary calls*. It is the random variable that relates the whole network, and is considered a mediating random variable. This means that it transmits the influence of *burglary* and *earthquake* to both *mary calls* and *john calls*. This is illustrated via the directed edges we see in the network. This implies conditional dependencies between alarm and its parents, and between alarm and its children. In other words, the probability of the *alarm* being activated is influenced by whether a burglary or earthquake has occurred, which further influences whether John or Mary call.

**Random Variables: john calls & mary calls**

Looking at the network, *john calls* and *mary calls* is the ultimate effect. Since they both have no children this is where the flow of information terminates. As mentioned in the previous section on their parent node alarm, the directed edges that flow from *alarm* to *john calls* and *mary calls* implies a dependent conditional relationship. Or, in other words, *john calls* and *mary calls* are conditionally dependent on *alarm*, just as* alarm* is conditionally dependent on *earthquake* and *burglary*.

**DRAWING INFERENCES USING LAWS OF PROBABILITY**

Now, lets see what type of inferences we can draw from this example network.

**Example Query 1**:*Given that John calls and Mary doesn’t call, what is the probability that there was a burglary and an earthquake? In other words, P(B=t, E=t J = t, M = f)*

Using Bayes Theorem we have:

This result makes sense right? Logically, It would be a very rare event if both a burglary and an earthquake happen and John calls but Mary does not.

**Example Query 2**:*What is the probability that the alarm has sounded but neither a burglary or earthquake has occurred, and both John and Mary call?*

We can calculate this query in two different ways, we could use Bayes theorem again, or we can read the question very carefully and see that we can just use the multiplication rule for dependent events

If we remember from the review we did we can look back and see that the product rule for dependent events and conditionally dependent events is:

Therefore the joint probability is:

Now, is this logical to be such a small number? The answer is yes, we would expect the probability of the alarm sounding when there is no burglary and earthquake and John and Mary both call to be a very infrequent event. Meaning, this combination of situations happening at the same time doesn’t happen that often.

**Example Query 3**:*What is the marginal probability of alarm?*

To find marginal probabilities within Bayesian Networks we can use a very simple tool. The law of total probability

This essentially states, that when we want to find a marginal probability, in this case, we want to find the marginal probability of alarm, we just take a weighted sum of the appropriate conditional probability multiplied by the marginal of every combination of states of the variable of interest. In other words, we sum up the contributions from each scenario, weighted by the probability of that scenario occurring. The algorithm is as follows:

This is exactly what we use in the denominator of Bayes Theorem, depending on how many different states we are working with.

**Some Final Thoughts**

So, I hope whoever has read this article has either learned something new, or has reinforced what they already know. Either way it has been a great experience writing my first article on medium. I look forward to learning along with everyone again when I write the next article in this series. I haven’t named it yet, so lets just call it *‘Part 2’* for now. In *Part 2*, I will be focusing on the implementation of the Bayesian Network that was shown at the beginning of this article. We will be using the Python library *pygm *. It is a library that abstracts away a lot, and allows us to build graphical models with relative ease. I have attached the data via a link to my GitHub located at the start of the article, in case anyone wants to get a head start!

**Until next time, Sayonara!**