Getting Started with Scikit-learn’s Toy Datasets

Introduction

Getting started with data science can be a little intimidating at first. Luckily, one of the most popular python data science packages, scikit-learn (or sklearn), comes with a couple of "toy" datasets that can help you get your feet wet right out of the gate. In this blog post we’ll be discussing what you need to do take a look at these datasets and hit the ground running. The methods we’ll go over are relatively basic, but serve as a fundamental framework for a working data scientist’s toolkit.

Prerequisites

For this tutorial we’ll be using python and a few of the libraries from the SciPy ecosystem. If you’ve never used python before and/or have no idea what SciPy is, don’t worry! One great way to get everything you need installed all at once is to download the Anaconda distribution. It’s free and provides a platform for working with both python and R while providing the framework for easy package management across Windows, macOS, and Linux. If you don’t have it installed yet, make sure to make your way to the download page and get everything set up before reading on.

Importing Libraries

After getting everything installed with Anaconda, I recommend opening the Anaconda Navigator and pulling up either Jupyter Notebook or Jupyter Lab. Both of these tools provide a notebook workflow which allows you to organize code into cells that can immediately display an output when run. The first thing we’ll need to do when you get there is import the libraries necessary for loading and working with our data.

import seaborn as sns
import pandas as pd
from sklearn import datasets

The first library, seaborn, is what we’ll be using to visualize our data. It provides the capabilities to make a variety of beautiful looking plots in only a few lines of code. The next library, pandas, is what we’ll be using to view our data and get some basic information about the different features within it. Our last library, sklearn, is primarily used for preparing data for machine learning applications. For now we’ll just be using it to obtain the toy dataset that we’ll be working with.

Importing the Dataset

Now that we have all of our packages imported, we can get to the data. You can see all datasets that sklearn comes with by referring to the documentation. For this tutorial we’ll be working with the breast cancer dataset. This dataset contains measurements and characteristics of breast tumors along with an indicator as to whether or not they ended up being malignant. The basic goal when building a model from data like this is to see if you can accurately predict whether or not a new patient’s tumor will end up being malignant or not. This is known as a classification problem in the data science and machine learning model, but we won’t be talking about that too much in this tutorial. To load it into python, all you have to do is execute the following code:

data = datasets.load_breast_cancer()

Sklearn treats these objects just like dictionaries, one of python’s built-in datatypes. To see all of the values associated with it, just run

data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

The ‘DESCR’ key in this dictionary contains a description about what our data actually is, where it comes from, its different features, etc. Let’s take a look at it now:

print(data['DESCR'])


.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign
...

We can then access the data itself by running

data['data']

and

data['target']

The ‘data’ key contains all of the various features of the tumors themselves, while target contains a binary value indicating whether or not the tumor was malignant. Working with the data through dictionaries alone can get annoying really quickly. So to make our lives easier, we can throw everything into a Pandas DataFrame. I’ve written a small function that will automate this process for any of the toy sklearn datasets:

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset['data'], columns=sklearn_dataset['feature_names'])
    df['target'] = pd.Series(sklearn_dataset['target'])
    return df

To use this function and get our dataframe, simply do the following:

cancer_df = sklearn_to_df(datasets.load_breast_cancer())

A First Look at the Data

Now that we have our data loaded, let’s just start looking at it! Pandas comes equipped with a few methods designed for just this purpose. The ones we’ll be using are .head(), .info(), and .describe().

cancer_df.head()

Basically all this has done is output the first 5 rows of our data. It’s one of the first things you should do when working with any new dataset just to get a feel for what it looks like. We can get information about what the datatypes within the columns themselves and if there are missing values that we should worry about by doing

cancer_df.info()

From this we can see right away that we have 569 total rows in our dataset. We also see that each of the columns contains 569 non-null entries, which means that we don’t have any obvious missing values that we need to worry about. This is usually the case for toy datasets like this, but you should never expect it in the real world! Real data is often messy, and figuring out how to deal with it is an important aspect of a data scientist’s work. The next thing to notice are the datatypes on the far right. Everything except for our target is a float, or a numerical value with decimal places. The target, which we already know is a binary value since this is a classification value, is stored as an integer. The next obvious thing to do is to get some basic statistics about each of these columns. We can do this is in one line by simply running

cancer_df.describe()

Amazing! With only one line of code pandas has given us a wealth of valuable statistics. We immediately know the means, standard deviations, minimums, maximums, and the quartile values for all of the tumor features. This is just one of many examples of how powerful this library can be for working with data. If we want to, we can also get individual statistics for the different features by referencing them directly and using some of the statistics methods that are built into Pandas. For classification problems, it’s important that our target is "balanced," meaning that we should have just as many zeros as we do ones. We won’t get into why this is important right now, but let’s go ahead and check to see how balanced this target data is.

cancer_df['target'].mean()

0.6274165202108963

Because the mean here is greater than 0.5, that means that the majority of the tumors in this dataset are malignant. With a value of around 0.63, the target isn’t terribly unbalanced, but it’s certainly something that a data scientist needs to take into account when creating a model.

Visualization with Seaborn

We now have a pretty good idea of what our data looks like in a numerical sense, but we also need a way to present it to others. As I mentioned earlier, the seaborn library makes this incredibly easy to do, especially since it’s built to be used specifically with pandas dataframes. If we want to get a visualization of our summary statistics for the .describe() method above, we can use a box plot. To get this with seaborn, all you have to do is run

sns.boxplot('mean radius', data = cancer_df)

That’s it! This is the most basic version of a seaborn box plot. There are a variety of other things that you can do to it through the various arguments that the box plot method accepts. I encourage you to take a look at these in the documentation and play around with it to see what you can come up with.

Next, let’s try to make a scatterplot. These can be invaluable when modeling because they can show correlations between different features. If there are correlations between the different features in the dataset, that indicates that one of them can be thrown out in order to make the model more simple. Again, we won’t talk about why this is necessary here, so let’s just take a look at how to do it. We’ll look at correlations between the measured radius of a tumor and its area. This is a rigged example since basic geometry tells us that there should be a squared relationship between these two variables. Let’s confirm this:

sns.scatterplot('mean radius', 'mean area', data = cancer_df)

This looks pretty parabolic to me! We could use some of seaborn’s other built in methods to confirm this, but I’ll leave that to another tutorial.

Summary

In this tutorial, we looked at one of the toy datasets provided through the sklearn library. With a relatively basic function, we were able to load this data into a pandas dataframe and utilize some of the powerful pandas methods to make understanding our data easy. Combining this with seaborn gave us the tools necessary to get quick and nice looking visuals. Everything we’ve done here only took 19 total lines of code. This may all seem very basic, but this is the kind of work that a data scientist depends on every day. In a future post, I’ll talk about how we can further utilize what we’ve done here to aid in the development of a classification model to predict whether or not the tumors in this dataset are malignant or not.

Published by garrettlducharme

Data scientist and tuba enthusiast

Leave a comment

Design a site like this with WordPress.com
Get started