Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Loading the dataset

The dataset Ionosphere, which high-frequency antennas. The aim of the antennas is to determine whether there is a structure in the ionosphere and a region in the upper atmosphere. We consider readings with a structure to be good, while those that do not have structure are deemed bad. The aim of this application is to build a data mining classifier that can determine whether an image is good or bad.

(Image Credit: https://www.flickr.com/photos/geckzilla/16149273389/)

You can download this dataset for different data mining applications. Go to http://archive.ics.uci.edu/ml/datasets/Ionosphere and click on Data Folder. Download the ionosphere.data and ionosphere.names files to a folder on your computer. For this example, I'll assume that you have put the dataset in a directory called Data in your home folder. You can place the data in another folder, just be sure to update your data folder (here, and in all other chapters).

The location of your home folder depends on your operating system. For Windows, it is usually at C:Documents and Settingsusername. For Mac or Linux machines, it is usually at /home/username. You can get your home folder by running this python code inside a Jupyter Notebook:

import os print(os.path.expanduser("~"))

For each row in the dataset, there are 35 values. The first 34 are measurements taken from the 17 antennas (two values for each antenna). The last is either 'g' or 'b'; that stands for good and bad, respectively.

Start the Jupyter Notebook server and create a new notebook called Ionosphere Nearest Neighbors. To start with, we load up the NumPy and csv libraries that we will need for our code, and set the data's filename that we will need for our code.

import numpy as np 
import csv 
data_filename = "data/ionosphere.data"

We then create the X and y NumPy arrays to store the dataset in. The sizes of these arrays are known from the dataset. Don't worry if you don't know the size of future datasets - we will use other methods to load the dataset in future chapters and you won't need to know this size beforehand:

X = np.zeros((351, 34), dtype='float') 
y = np.zeros((351,), dtype='bool')

The dataset is in a Comma-Separated Values (CSV) format, which is a commonly used format for datasets. We are going to use the csv module to load this file. Import it and set up a csv reader object, then loop through the file, setting the appropriate row in X and class value in y for every line in our dataset:

with open(data_filename, 'r') as input_file: 
    reader = csv.reader(input_file) 
    for i, row in enumerate(reader): 
        # Get the data, converting each item to a float 
        data = [float(datum) for datum in row[:-1]] 
        # Set the appropriate row in our dataset 
        X[i] = data 
        # 1 if the class is 'g', 0 otherwise 
        y[i] = row[-1] == 'g'

We now have a dataset of samples and features in X as well as the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining.

To begin with, try applying the OneR algorithm from Chapter 1, Getting Started with Data Mining to this dataset. It won't work very well, as the information in this dataset is spread out within the correlations of certain features. OneR is only interested in the values of a single feature and cannot pick up information in more complex datasets very well. Other algorithms, including Nearest Neighbor, merge information from multiple features, making them applicable in more scenarios. The downside is that they are often more computationally expensive to compute.