*(If you haven’t already, you should read **part 1 of this article** which was posted on April 13, 2011)*

Simpson’s paradox illustrates an intuitive error in reasoning that’s hard to accept without credible examples. It occurs when a correlation or trend that is present in groups is reversed when the groups are combined. The example we’ve been working with deals with batting averages. A hypothetical situation was presented in the first part of the article in which we stated that one batter had a higher batting average against left-handed pitchers and against right-handed pitchers, separately. However, when the overall batting average was calculated, that batter had a lower average.

Here’s the table once more just for reference:

Certainly it was an interesting example but it was entirely made-up. I just tweaked the numbers a bit to get the table to work. I set my sights on finding some of my own examples of Simpson’s paradox based on REAL data.

### Attempt 1

Yahoo Sports provides some easily accessible baseball team split stats, e.g., home vs. road, pre-all star break vs. post-all star break, turf vs. grass, indoor vs. outdoor (http://yhoo.it/hKAiLn). I started by pulling the overall team batting averages and then their averages split by home games versus road games (for the 2010 season). By pasting into Excel and sorting, then manually checking for a good 15 minutes, I was able to establish that there was no example of Simpson’s paradox in this data set. Bummer.

### Attempt 2

This process had to be automated because there was no way I was going to spend 15+ minutes checking every split, every season, until I found an example. While this could be fairly easily done with a VBA script (e.g. macro) in Excel, I decided to move over to MATLAB simply because I’m more well-versed in scripting there. Algorithmically speaking, it really is a piece of cake to automate.

NOTE: If the details of the script don’t interest you, you can skip down to the results. And yes, I found some examples. WOOHOOO!

First, you need your data read into the program. In Excel, I had a table with the first column as the team names, the second column was the overall batting average and the third and fourth column were the split stats (i.e., team batting averages for road vs. home). To move this into MATLAB, I created a cell array called *teams* which would contain team names.

teams = {}

Then, after opening the teams variable in the variable editor window (double-click it in the Workspace), I pasted in the team names (right-click, “Paste Excel Data”). Next, I created a variable *x* and pasted in the three columns of batting average data.

At this point a very simple little script (simphunt.m) tests every pair of teams for Simpson’s paradox. The main step of the script is to compare the overall averages for each pair and the split stats for each pair. If the relation of the split is reversed in the overall, then we have an example.

if (x(i,1)>x(j,1) && x(i,2)<x(j,2) && x(i,3)<x(j,3)) || …

(x(i,1)<x(j,1) && x(i,2)>x(j,2) && x(i,3)>x(j,3))

### Results

The first split I tested was, again, **home vs. road** (to which I already knew the answer):

There are no cases of Simpson’s Paradox

The next was **day vs. night**. Disappointed again.

There are no cases of Simpson’s Paradox

Fortunately, it was only taking a minute or so to fetch the data and move it into MATLAB. Nevertheless, I was still feeling a little disheartened. I was beginning to expect to run through dozens of splits, maybe from multiple seasons, or even moving on to player vs. player stats instead of teams.

Then I went with **indoor vs. outdoor**. JACKPOT!!!

There are 5 cases of Simpson’s Paradox

‘ Chicago Cubs’ ‘ Oakland Athletics’

‘ Cleveland Indians’ ‘ Tampa Bay Rays’

‘ Colorado Rockies’ ‘ Milwaukee Brewers’

‘ Los Angeles Angels’ ‘ Tampa Bay Rays’

‘ New York Mets’ ‘ Toronto Blue Jays’

Five Cases!!! Looking back at the data we can confirm:

I could clearly keep hunting but I’ve satisfied my desire for finding my own examples. Next time I bring up Simpson’s paradox in any of my classes, I’ll start with my manufactured example but quickly move on to these REAL examples.

**Read More… **Find out more about Simpson’s Paradox at the following links:

*Simpson’s paradox – Wikipedia, the free encyclopedia* *http://www.furtadoworld.com/handouts/SimpsonParadox.pdf *

*When Combined Data Reveal the Flaw of Averages*

what was the algorithm you used on MATLAB for this (Simpson’s Paradox)? Could you send it to me?

There is a link to the code in the blog post, look for simphunt.m