Introduction
Getting started with data science can be a little intimidating at first. Luckily, one of the most popular python data science packages, scikit-learn (or sklearn), comes with a couple of "toy" datasets that can help you get your feet wet right out of the gate. In this blog post we’ll be discussing what you need to do take a look at these datasets and hit the ground running. The methods we’ll go over are relatively basic, but serve as a fundamental framework for a working data scientist’s toolkit.
Prerequisites
For this tutorial we’ll be using python and a few of the libraries from the SciPy ecosystem. If you’ve never used python before and/or have no idea what SciPy is, don’t worry! One great way to get everything you need installed all at once is to download the Anaconda distribution. It’s free and provides a platform for working with both python and R while providing the framework for easy package management across Windows, macOS, and Linux. If you don’t have it installed yet, make sure to make your way to the download page and get everything set up before reading on.
Importing Libraries
After getting everything installed with Anaconda, I recommend opening the Anaconda Navigator and pulling up either Jupyter Notebook or Jupyter Lab. Both of these tools provide a notebook workflow which allows you to organize code into cells that can immediately display an output when run. The first thing we’ll need to do when you get there is import the libraries necessary for loading and working with our data.
import seaborn as sns
import pandas as pd
from sklearn import datasets
The first library, seaborn, is what we’ll be using to visualize our data. It provides the capabilities to make a variety of beautiful looking plots in only a few lines of code. The next library, pandas, is what we’ll be using to view our data and get some basic information about the different features within it. Our last library, sklearn, is primarily used for preparing data for machine learning applications. For now we’ll just be using it to obtain the toy dataset that we’ll be working with.
Importing the Dataset
Now that we have all of our packages imported, we can get to the data. You can see all datasets that sklearn comes with by referring to the documentation. For this tutorial we’ll be working with the breast cancer dataset. This dataset contains measurements and characteristics of breast tumors along with an indicator as to whether or not they ended up being malignant. The basic goal when building a model from data like this is to see if you can accurately predict whether or not a new patient’s tumor will end up being malignant or not. This is known as a classification problem in the data science and machine learning model, but we won’t be talking about that too much in this tutorial. To load it into python, all you have to do is execute the following code:
data = datasets.load_breast_cancer()
Sklearn treats these objects just like dictionaries, one of python’s built-in datatypes. To see all of the values associated with it, just run
data.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
The ‘DESCR’ key in this dictionary contains a description about what our data actually is, where it comes from, its different features, etc. Let’s take a look at it now:
print(data['DESCR'])
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
...
We can then access the data itself by running
data['data']
and
data['target']
The ‘data’ key contains all of the various features of the tumors themselves, while target contains a binary value indicating whether or not the tumor was malignant. Working with the data through dictionaries alone can get annoying really quickly. So to make our lives easier, we can throw everything into a Pandas DataFrame. I’ve written a small function that will automate this process for any of the toy sklearn datasets:
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset['data'], columns=sklearn_dataset['feature_names'])
df['target'] = pd.Series(sklearn_dataset['target'])
return df
To use this function and get our dataframe, simply do the following:
cancer_df = sklearn_to_df(datasets.load_breast_cancer())
A First Look at the Data
Now that we have our data loaded, let’s just start looking at it! Pandas comes equipped with a few methods designed for just this purpose. The ones we’ll be using are .head(), .info(), and .describe().
cancer_df.head()

Basically all this has done is output the first 5 rows of our data. It’s one of the first things you should do when working with any new dataset just to get a feel for what it looks like. We can get information about what the datatypes within the columns themselves and if there are missing values that we should worry about by doing
cancer_df.info()

From this we can see right away that we have 569 total rows in our dataset. We also see that each of the columns contains 569 non-null entries, which means that we don’t have any obvious missing values that we need to worry about. This is usually the case for toy datasets like this, but you should never expect it in the real world! Real data is often messy, and figuring out how to deal with it is an important aspect of a data scientist’s work. The next thing to notice are the datatypes on the far right. Everything except for our target is a float, or a numerical value with decimal places. The target, which we already know is a binary value since this is a classification value, is stored as an integer. The next obvious thing to do is to get some basic statistics about each of these columns. We can do this is in one line by simply running
cancer_df.describe()

Amazing! With only one line of code pandas has given us a wealth of valuable statistics. We immediately know the means, standard deviations, minimums, maximums, and the quartile values for all of the tumor features. This is just one of many examples of how powerful this library can be for working with data. If we want to, we can also get individual statistics for the different features by referencing them directly and using some of the statistics methods that are built into Pandas. For classification problems, it’s important that our target is "balanced," meaning that we should have just as many zeros as we do ones. We won’t get into why this is important right now, but let’s go ahead and check to see how balanced this target data is.
cancer_df['target'].mean()
0.6274165202108963
Because the mean here is greater than 0.5, that means that the majority of the tumors in this dataset are malignant. With a value of around 0.63, the target isn’t terribly unbalanced, but it’s certainly something that a data scientist needs to take into account when creating a model.
Visualization with Seaborn
We now have a pretty good idea of what our data looks like in a numerical sense, but we also need a way to present it to others. As I mentioned earlier, the seaborn library makes this incredibly easy to do, especially since it’s built to be used specifically with pandas dataframes. If we want to get a visualization of our summary statistics for the .describe() method above, we can use a box plot. To get this with seaborn, all you have to do is run
sns.boxplot('mean radius', data = cancer_df)

That’s it! This is the most basic version of a seaborn box plot. There are a variety of other things that you can do to it through the various arguments that the box plot method accepts. I encourage you to take a look at these in the documentation and play around with it to see what you can come up with.
Next, let’s try to make a scatterplot. These can be invaluable when modeling because they can show correlations between different features. If there are correlations between the different features in the dataset, that indicates that one of them can be thrown out in order to make the model more simple. Again, we won’t talk about why this is necessary here, so let’s just take a look at how to do it. We’ll look at correlations between the measured radius of a tumor and its area. This is a rigged example since basic geometry tells us that there should be a squared relationship between these two variables. Let’s confirm this:
sns.scatterplot('mean radius', 'mean area', data = cancer_df)

This looks pretty parabolic to me! We could use some of seaborn’s other built in methods to confirm this, but I’ll leave that to another tutorial.
Summary
In this tutorial, we looked at one of the toy datasets provided through the sklearn library. With a relatively basic function, we were able to load this data into a pandas dataframe and utilize some of the powerful pandas methods to make understanding our data easy. Combining this with seaborn gave us the tools necessary to get quick and nice looking visuals. Everything we’ve done here only took 19 total lines of code. This may all seem very basic, but this is the kind of work that a data scientist depends on every day. In a future post, I’ll talk about how we can further utilize what we’ve done here to aid in the development of a classification model to predict whether or not the tumors in this dataset are malignant or not.