Python Data Mining Quick Start Guide

Nathan Greeneltch

更新时间：2021-06-24 15:20:20

coverpage

Title Page

Python Data Mining Quick Start Guide

Dedication

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Data Mining and Getting Started with Python Tools

Descriptive predictive and prescriptive analytics

What will and will not be covered in this book

Recommended readings for further explanation

Setting up Python environments for data mining

Installing the Anaconda distribution and Conda package manager

Installing on Linux

Installing on Windows

Installing on macOS

Launching the Spyder IDE

Launching a Jupyter Notebook

Installing high-performance Python distribution

Recommended libraries and how to install

Recommended libraries

Summary

Basic Terminology and Our End-to-End Example

Basic data terminology

Sample spaces

Variable types

Data types

Basic summary statistics

An end-to-end example of data mining in Python

Loading data into memory – viewing and managing with ease using pandas

Plotting and exploring data – harnessing the power of Seaborn

Transforming data – PCA and LDA with scikit-learn

Quantifying separations – k-means clustering and the silhouette score

Making decisions or predictions

Summary

Collecting Exploring and Visualizing Data

Types of data sources and loading into pandas

Databases

Basic Structured Query Language (SQL) queries

Disks

Web sources

From URLs

From Scikit-learn and Seaborn-included sets

Access search and sanity checks with pandas

Basic plotting in Seaborn

Popular types of plots for visualizing data

Scatter plots

Histograms

Jointplots

Violin plots

Pairplots

Summary

Cleaning and Readying Data for Analysis

The scikit-learn transformer API

Cleaning input data

Missing values

Finding and removing missing values

Imputing to replace the missing values

Feature scaling

Normalization

Standardization

Handling categorical data

Ordinal encoding

One-hot encoding

Label encoding

High-dimensional data

Dimension reduction

Feature selection

Feature filtering

The variance threshold

The correlation coefficient

Wrapper methods

Sequential feature selection

Transformation

PCA

LDA

Summary

Grouping and Clustering Data

Introducing clustering concepts

Location of the group

Euclidean space (centroids)

Non-Euclidean space (medioids)

Similarity

Euclidean space

The Euclidean distance

The Manhattan distance