Data Science Tutorials, but Sometimes Other Stuff

Featured

About Me

Hi! My name is Garrett DuCharme. I’m currently a student studying data science at the Flatiron school in Chicago. In another life I was a physics graduate student at Penn State University, where I earned my master’s degree. Outside of school I enjoy playing the tuba, powerlifting, and hanging out with my cat, Asher.

This blog will primarily be focused on data science tutorials so that I can practice my own writing skills and give back to the community as a whole as an educator. But as the title implies, I’ll sometimes write about other stuff related to my life outside of data science, and just whatever else I find interesting.

Thanks for reading!

Automating Auto-SARIMA for Large-Scale Time Series Forecasting

For a recent project as part of my bootcamp at The Flatiron School, I wanted to explore the power of SARIMA forecasting for predicting important climate metrics provided by the NOAA, such as average temperature and precipitation. Given the limited time that I had for this project, I was only able to really focus on predictions for the state of California. My original idea was to apply these forecasting methods to all 48 contiguous states for all of the metrics that were available. In this blog post, I’ll briefly discuss the fundamentals for determining a good time series model within the context of our problem. Then, I’ll jump into how we can use this fundamental intuition in order to increase the efficiency of applying auto-ARIMA to our entire dataset.

Problem Statement

The problem we are trying to solve here is how we accurately we can forecast changes in key climate predictors such as temperature and precipitation. In a (perhaps not so hypothetical) scenario where the status quo surrounding climate policy doesn’t change, we need to know which areas of the country will need the biggest allocations of federal relief aid. This problem can be extended to focus on the allocating aid to individual counties from the state level, but this adds another layer of complexity. For now, we’ll just focus on the 48 contiguous states.

Data Visualizations

The metrics available from the climate at a glance database are average temperature, maximum temperature, minimum temperature, cooling degree days, heating degree days, precipitation, and then the pdsi, phdi, and pmdi metrics for drought. Let’s just take California as a first example and see what our data looks like.

The first thing we can see is that for the temperature and precipitation data, there is a clear yearly seasonality. However, besides the small differences in the peaks and troughs, we can’t easily any trends. For the drought indicators, there are many changes in trend with no visible yearly seasonality. In order to get a sense of the trend, we’ll take a 12 month moving average:

This moving average has effectively smoothed out all of the noise in our drought indicators. There is noise remaining in the the temperature and precipitation series, but the trend is much more visible. Let’s now look at the moving averages for Illinois to see if we can spot any differences:

The story is the same for the general changes to the plots, but the trends in Illinois appear different. The average trend from 2010 to 2019 for Illinois temperature data is roughly zero, but for these same years for California it is positive. This already indicates that our SARIMA model are likely going to have very different parameters across all of the states.

I mentioned the years 2010 and onwards because this is likely the range that we’ll need to use for fitting and validating our models. The reason for this is that SARIMA is not robust against sudden changes in trends. In other words, the best way to predict what’s going to happen in the future is to use mostly recent data. However, we still can’t use too few data points. In a sense this choice of when to start the forecasting is arbitrary. I picked 2010 in this case because it seems to be the start of a new trend, and because it’s relatively close to the current date while still providing us with enough monthly predictions to fit a model.

Choosing the Order of Differencing

When applying the SARIMA model, one of the requirements is that the time series must be able to be made stationary through differencing. This is especially important when considering the time range that you’re going to be fitting your model for. If your time series has seasonal changes that dwarf the moving average trends over time, such as in the case for our average temperatures, it is likely that it would appear to be stationary with a Dickey-Fuller test without applying any order of differencing. However, if we constrain ourselves to the last 10 years, this may no longer be the case.

We already know that we’ll need 1 order of seasonal differencing, so let’s apply that and then run Dickey-Fuller tests to see which data may still require monthly differencing in order to be made stationary. The following code accomplishes this task:

The entries in the above dictionary are how many of the time series weren’t stationary after 1 order of seasonal differencing. We see right away that the majority of our temperature and precipitation data ends up being stationary, while most of the drought data does not. This makes sense, as we didn’t observe any obvious seasonality in the drought data when we were visualizing the time series.

In the course of researching more for this project, I actually learned that the drought indicators are already bases on temperature and precipitation data. Because of this, it doesn’t really make sense to try and forecast their behavior. For the rest of the blog, we’ll just be focusing on the temperature and precipitation data.

Implementing Auto-SARIMA

We’ll be implementing Auto-SARIMA with the python pmdarima library. The idea will be to create training and test sets for each of our states. Then, the auto-SARIMA algorithm will be applied to each of the training sets and validated on each of our test sets. We won’t be able to compare across metrics with different units, but this can certainly give us an idea about which temperature metrics are the easiest to predict. The models will be selected based on whichever has the lowest AIC, but in order to compare results for different metrics and across all states we will be using the mean squared error. We’ll also want to make sure that we save all of our forecasts, as well as important information such as the orders, seasonal orders, and AIC and BIC scores. Here is the implementation:

In the above code, we’ve drastically reduced the fitting time by constraining the seasonal order to D = 1. In addition, we say that d can be no greater than 1 in order to avoid issues with overdifferencing. I’ve started all AR and MA terms at zero and providing maximum values for them as well. This is done both to reduce computational time and to avoid overfitting.

First Results

As a first result, let’s just take a look at the MSE for the test data.

From this, we see that the average temperature gives us the most accurate predictions. While the minimum temperature is comparable, the spread in the errors is larger. The maximum temperature has the largest average error overall, and additionally has some particularly bad outliers.

Future Work

The results we currently have are just a starting point for this project. We’ll want to also implement some methods that can make visualizing the forecasts for the large errors easier in order to determine sticking points for these types of models. Additionally, we can apply other time series forecasting methods, such as Facebook’s Prophet, in order to compare things such as run time and accuracy. Finally, it would be beneficial to implement parallelization in order to further reduce the computational time.

What is “Statistically Significant” Anyways?

Anyone who has ever done work in the biological sciences, demography, sociology, psychology, and now, data science, knows what a p-value is. It’s the first thing people usually reference when showing that an observed result is “statistically significant” when performing Null Hypothesis Significance Testing (NHST). The problem, though, is that p-values are widely misunderstood are unfortunately used to draw erroneous conclusions, especially in the era of big data. Because I’m a budding data scientist, and also because my pedantry knows no bounds, I wanted to make sure that I really understood the proper ways to perform hypothesis tests and correctly interpret the all-so holy p-values that are used to draw conclusions about them. I ended up coming across a paper titled “P-Values: Misunderstood and Misused” and it really challenged my very elementary understanding of statistical methods and interpretation. Throughout the rest of this blog post, I’ll be summarizing the key takeaways from this paper and how I interpret and understand them.

Paper Review

1. Introduction

Thankfully, the authors don’t waste any time and get right to the motivation of their manuscript. They quickly identify that some of the key issues surrounding p-values are that (i) they’ve become increasingly prevalent, often among inexperienced practitioners, in the big data era. And (ii) many top journals require p-values which require statistical significance (there are those words again) in order for a study to be accepted for publication. The problem here is that even when p-values are used correctly, they are often an ineffective metric. A p-value below a significance threshold of α = 0.05 can still mean that the study in question has a false detection rate (FDR) of around 30%. This means that of all the results labelled “positive,” the conclusion is incorrect 30% of the time (more on this later) due to various research constraints. This already sounds bad!

They go on to reference a neat paper called “The Earth is Round (p<0.05)” which is highly critical of NHST. I didn’t dig into this too much , but the crux of the argument is that it is very easy for one to define and subsequently disprove a “nil” hypothesis in which there is no effect size difference between two samples. Contrast this with a true “null” hypothesis where the effect size (or perhaps a range of effect sizes) is clearly defined. Not only is the former statement mathematically weaker, it is also lacking in the sense that it does not imply anything about causality. In fact, given enough data and enough hidden, underlying features, one is almost certainly going to always reject a nil hypothesis. This is again a paramount issue with big data where these kinds of scenarios are plausible. How can it ever be interesting to show that there is a purely statistical difference between two samples without knowing what the actual cause is? The authors make a comment here about how one of the problems regarding this issue is how the language is perceived. When one rejects a nil hypothesis, the generally accepted terminology that one gets to invoke is that there result is statistically significant. This sounds good, but as we’ve just argued, it’s really not!

There’s also some background here about how Fisher’s original idea for hypothesis testing differs from that of Neyman and Pearson. I’ll leave the details to the reader, but the main point is that Fisher originally intended hypothesis testing to be a method which concludes that more studies need to be done. This is in line with traditional scientific thinking, where vast amounts of insurmountable evidence are accumulated carefully over time in order to make a compelling case. He most definitely did not intend for NHST to be used as something which allows a researcher to claim evidence based on a single study.

2. The False Discovery Rate

The point of this section is to dispel myths about the meaning of the significance level and how it relates to errors. The significance level α is equivalent to the type I error rate, but it is not the rate at which errors occur. The traditional error rate that people generally think of is actually defined as

$FDR = \frac{FP}{TP + FP}$ .

Or, in terms of “prevalence” and statistical power:

$FDR = \frac{(1-prevalence)*type\, I\, error\, rate}{prevalence*power + (1-prevalence)*type\, I\, error\, rate}$ .

Prevalence here means the number of effects that actually exist in the real world out of all the effect that are tested for. For example, if we look at 20 different tumors and it turns out that 10 of them are actually cancerous, the prevalence would be 0.50. Of course, the prevalence is hard to accurately measure in practice. With all of this in mind, the authors show the following table to highlight just how bad the FDR can be:

Because the prevalence is not something that one generally knows a priori, calculations of the FDR are actually difficult to make in practice. The point here though is that even with some generous estimations, the FDR is generally much higher than what one traditionally thinks of as the true “error rate.” This also assumes that researchers have properly set up their experiments and haven’t manipulated the data in ways which allow them to achieve statistical significance. With this in mind, the FDR is probably higher than one might initially think.

3. Prevalence and Bayes

This section was a bit hard for me to properly understand without any prior knowledge of Bayesian statistics. From my understanding, the point here is that FDR is something which is estimated based on the quantification of a researchers subjective beliefs, and this makes it inherently Bayesian. Traditional statistics, also known as frequentism, focus on calculating the probability of detecting something given the null hypothesis:

$P(D|H_{o})$ .

A Bayesian approach focuses on calculating the alternative hypothesis given a detection:

$P(H_{1}|D)$ .

The Bayesian approach is said to be subjective because it requires the researcher to make a decision about the alternative hypothesis based on how likely they think the data is true. To reiterate, FDR calculations are Bayesian because they assign a prior probability (1-prevalence) before actually obtaining data from the experiment. The authors conclude this section by saying that the criticisms against Bayesian methods like this are valid, but unwarranted because there is really no better way to get a good estimate of the FDR.

4. Publication Bias: Selective Reporting and P-Hacking

This section isn’t too interesting when considered outside of an academic context, but there are still a few good points to consider. Selective reporting is where methodologically robust but statistically insignificant results are not reported because top journals do not consider them to be interesting. P-hacking, which could be a considerable problem in the data science world, is when researchers consciously select data and/or statistical tests which make their results go from being insignificant to significant. This direct manipulation of results leads to false reporting, which is perfectly exemplified in this xkcd comic. Directly manipulating data in this way may not give you a result which is technically false, but it obviously leads to conclusions which are not representative of the underlying phenomena at play. This is usually done by (i) performing statistical testing before an experiment reaches conclusion, (ii) selecting specific response variables to skew the narrative, (iii) excluding or combining treatment and control groups, (iv) including or excluding important covariates, or (v) stopping analysis as soon as a significant p-value is achieved.

The authors also reference an interesting simulation on the Fivethirtyeight website which lets you p-hack an analysis which shows that a good economy can be positively or negatively correlated with whichever political party currently has the majority. Lastly, the authors reference a meta-analysis which shows that a considerable number of studies report p-values just below a significance level of 0.05, suggesting that p-hacking is likely very prolific in academic studies.

5. What to do?

The authors wrap up this paper by going back to Bayesian statistics and saying that while it is useful for specific cases, it should not replace the frequentist interpretation as the dominant paradigm. Being more mindful of Bayesian approaches, however, can allow one to lower FDR in a way that makes findings more credible at various significance levels. Other suggestions for improving FDR are to more rigorously control for the type I error rate and the statistical power. They recommend α = 0.01, or even 0.001. Suggestions from other researchers include to instead report with confidence intervals, or to instead focus on effect size as a metric instead of statistical significance. The argument here is that statistically significant results with small effect sizes should be treated with great skepticism, especially within the context of big data.

The main conclusion, however, is that research processes need to be recorded more transparently. I really like the quote that the authors reference here:

“One of the strongest protections for scientists is to admit everything.”
Regina Nuzzo

It may also be useful to better categorize studies, giving them labels such as “exploratory” or “confirmatory” so that readers have a better sense about how to interpret the results. Also of importance is transparency of data beyond summary statistics, especially when it comes to big data sets, given that exploratory data analysis and p-hacking sometimes ride the same line.

The paper wraps up by going back to Fisher’s original idea for NHST and p-values, and encourages the reader to be mindful of the broader context of study when confirming or rejecting a null hypothesis. This is why meta-analyses can be so valuable. If multiple studies can be brought together and find an effect to be significant, then it is more likely to truly exist.

Takeaways for Data Science

While reading this paper I was mostly thinking about the sad state of the academic world and scientific publishing. It is unfortunate that so many researchers feel the need to implement bias and hack their own results to achieve statistical significance. However, it doesn’t actually surprise me too much with an academic culture which hinges success on a “publish or perish” kind of lifestyle.

With that aside, I think there are some very interesting points to be made here for both aspiring and practicing data scientists. The first thing that comes to mind is that it’s incredibly important to carefully form a hypothesis when solving a business problem. As data scientists, we perhaps aren’t best equipped to formulate this hypothesis all on our own. The intelligent decision (in my opinion) would be to consult with colleagues in other departments to carefully construct meaningful hypotheses. What we do with NHST as data scientists isn’t magical; I don’t think it’s irrational to explain the basics of these concepts to a non-technical audience that may have a deeper business understanding than we do. This also falls directly in line with finding ways to estimate prevalence in order to calculate FDR prior to actually carrying out a study or survey.

Lastly, I think it is especially important to be mindful of the interpretations presented here when trying to solve a business problem with data science and NHST. Meta-analyses should be conducted whenever possible in order to compare internal data with publicly available data. Additionally, any conclusions need to be placed within the broader context of what a team is trying to accomplish. This strikes me as being especially important when conducting tests that naturally have low statistical power, and thus a high FDR. However, the feasibility of this seems questionable given the lack of publicly available data in the business world.

Getting Started with Scikit-learn’s Toy Datasets

Introduction

Getting started with data science can be a little intimidating at first. Luckily, one of the most popular python data science packages, scikit-learn (or sklearn), comes with a couple of "toy" datasets that can help you get your feet wet right out of the gate. In this blog post we’ll be discussing what you need to do take a look at these datasets and hit the ground running. The methods we’ll go over are relatively basic, but serve as a fundamental framework for a working data scientist’s toolkit.

Prerequisites

For this tutorial we’ll be using python and a few of the libraries from the SciPy ecosystem. If you’ve never used python before and/or have no idea what SciPy is, don’t worry! One great way to get everything you need installed all at once is to download the Anaconda distribution. It’s free and provides a platform for working with both python and R while providing the framework for easy package management across Windows, macOS, and Linux. If you don’t have it installed yet, make sure to make your way to the download page and get everything set up before reading on.

Importing Libraries

After getting everything installed with Anaconda, I recommend opening the Anaconda Navigator and pulling up either Jupyter Notebook or Jupyter Lab. Both of these tools provide a notebook workflow which allows you to organize code into cells that can immediately display an output when run. The first thing we’ll need to do when you get there is import the libraries necessary for loading and working with our data.

import seaborn as sns
import pandas as pd
from sklearn import datasets

The first library, seaborn, is what we’ll be using to visualize our data. It provides the capabilities to make a variety of beautiful looking plots in only a few lines of code. The next library, pandas, is what we’ll be using to view our data and get some basic information about the different features within it. Our last library, sklearn, is primarily used for preparing data for machine learning applications. For now we’ll just be using it to obtain the toy dataset that we’ll be working with.

Importing the Dataset

Now that we have all of our packages imported, we can get to the data. You can see all datasets that sklearn comes with by referring to the documentation. For this tutorial we’ll be working with the breast cancer dataset. This dataset contains measurements and characteristics of breast tumors along with an indicator as to whether or not they ended up being malignant. The basic goal when building a model from data like this is to see if you can accurately predict whether or not a new patient’s tumor will end up being malignant or not. This is known as a classification problem in the data science and machine learning model, but we won’t be talking about that too much in this tutorial. To load it into python, all you have to do is execute the following code:

data = datasets.load_breast_cancer()

Sklearn treats these objects just like dictionaries, one of python’s built-in datatypes. To see all of the values associated with it, just run

data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

The ‘DESCR’ key in this dictionary contains a description about what our data actually is, where it comes from, its different features, etc. Let’s take a look at it now:

print(data['DESCR'])


.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign
...

We can then access the data itself by running

data['data']

and

data['target']

The ‘data’ key contains all of the various features of the tumors themselves, while target contains a binary value indicating whether or not the tumor was malignant. Working with the data through dictionaries alone can get annoying really quickly. So to make our lives easier, we can throw everything into a Pandas DataFrame. I’ve written a small function that will automate this process for any of the toy sklearn datasets:

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset['data'], columns=sklearn_dataset['feature_names'])
    df['target'] = pd.Series(sklearn_dataset['target'])
    return df

To use this function and get our dataframe, simply do the following:

cancer_df = sklearn_to_df(datasets.load_breast_cancer())

A First Look at the Data

Now that we have our data loaded, let’s just start looking at it! Pandas comes equipped with a few methods designed for just this purpose. The ones we’ll be using are .head(), .info(), and .describe().

cancer_df.head()

Basically all this has done is output the first 5 rows of our data. It’s one of the first things you should do when working with any new dataset just to get a feel for what it looks like. We can get information about what the datatypes within the columns themselves and if there are missing values that we should worry about by doing

cancer_df.info()

From this we can see right away that we have 569 total rows in our dataset. We also see that each of the columns contains 569 non-null entries, which means that we don’t have any obvious missing values that we need to worry about. This is usually the case for toy datasets like this, but you should never expect it in the real world! Real data is often messy, and figuring out how to deal with it is an important aspect of a data scientist’s work. The next thing to notice are the datatypes on the far right. Everything except for our target is a float, or a numerical value with decimal places. The target, which we already know is a binary value since this is a classification value, is stored as an integer. The next obvious thing to do is to get some basic statistics about each of these columns. We can do this is in one line by simply running

cancer_df.describe()

Amazing! With only one line of code pandas has given us a wealth of valuable statistics. We immediately know the means, standard deviations, minimums, maximums, and the quartile values for all of the tumor features. This is just one of many examples of how powerful this library can be for working with data. If we want to, we can also get individual statistics for the different features by referencing them directly and using some of the statistics methods that are built into Pandas. For classification problems, it’s important that our target is "balanced," meaning that we should have just as many zeros as we do ones. We won’t get into why this is important right now, but let’s go ahead and check to see how balanced this target data is.

cancer_df['target'].mean()

0.6274165202108963

Because the mean here is greater than 0.5, that means that the majority of the tumors in this dataset are malignant. With a value of around 0.63, the target isn’t terribly unbalanced, but it’s certainly something that a data scientist needs to take into account when creating a model.

Visualization with Seaborn

We now have a pretty good idea of what our data looks like in a numerical sense, but we also need a way to present it to others. As I mentioned earlier, the seaborn library makes this incredibly easy to do, especially since it’s built to be used specifically with pandas dataframes. If we want to get a visualization of our summary statistics for the .describe() method above, we can use a box plot. To get this with seaborn, all you have to do is run

sns.boxplot('mean radius', data = cancer_df)

That’s it! This is the most basic version of a seaborn box plot. There are a variety of other things that you can do to it through the various arguments that the box plot method accepts. I encourage you to take a look at these in the documentation and play around with it to see what you can come up with.

Next, let’s try to make a scatterplot. These can be invaluable when modeling because they can show correlations between different features. If there are correlations between the different features in the dataset, that indicates that one of them can be thrown out in order to make the model more simple. Again, we won’t talk about why this is necessary here, so let’s just take a look at how to do it. We’ll look at correlations between the measured radius of a tumor and its area. This is a rigged example since basic geometry tells us that there should be a squared relationship between these two variables. Let’s confirm this:

sns.scatterplot('mean radius', 'mean area', data = cancer_df)

This looks pretty parabolic to me! We could use some of seaborn’s other built in methods to confirm this, but I’ll leave that to another tutorial.

Summary

In this tutorial, we looked at one of the toy datasets provided through the sklearn library. With a relatively basic function, we were able to load this data into a pandas dataframe and utilize some of the powerful pandas methods to make understanding our data easy. Combining this with seaborn gave us the tools necessary to get quick and nice looking visuals. Everything we’ve done here only took 19 total lines of code. This may all seem very basic, but this is the kind of work that a data scientist depends on every day. In a future post, I’ll talk about how we can further utilize what we’ve done here to aid in the development of a classification model to predict whether or not the tumors in this dataset are malignant or not.