CPSC/AMTH 445/545 - Introduction to Data Mining - Fall 2016 Yale
Introduction to Data Mining
Instructor: Guy Wolf (firstname.lastname@example.org)
TA: Nicholas Marshall (email@example.com)
ULA: Yutaro Yamada (firstname.lastname@example.org)
The ability to process and extract insightful information from large amounts of data has become a desired, if not necessary, skill in almost every field of industry and science.
Among other benefits, such information can provide useful knowledge, support decision-making, uncover hidden trends, and enable deeper understanding of observed phenomena.
This course will cover some of the main problems and challenges encountered in data analysis and applications, and provide fundamental tools and techniques for solving them.
We will discuss popular algorithms for data organization & visualization, such as principal component analysis (PCA) and multidimensional scaling (MDS).
Students will become familiar with a variety of machine learning and data mining approaches.
These will include both supervised approaches, such as performing classification with support vector machines (SVM), and unsupervised ones, such as clustering data with k-means.
The lectures and discussions in class will be accompanied by homework exercises that combine theoretical questions, which emphasize the understanding of underlying data mining principles, together with programming tasks (e.g., in MatLab and/or Python) that demonstrate practical implementations of studied data mining techniques.
Grades in this course will be based on these exercises, a project, and an exam.
The course assumes basic prior knowledge in probabilities, linear algebra, data structures, algorithms, and programming.
Lectures: Tuesdays & Thursdays 1:00-2:15, GR109 (Rosenfeld Hall, 109 Grove St)
Wednesdays 7:00 PM, AKW 100
Instructor: Wednesdays 6:00-7:00 PM, AKW 103
TA: Mondays 5:00-6:00 PM (or by appointment), AKW 307
ULA: Fridays 4:00-5:00 PM (or by appointment), AKW 200
No required textbook, but the following books are recommended for the course:
This is a tentative list of topics we intend to cover, which may change as we progress through the course:
- Introduction to data mining tasks
- Data exploration and visualization
- Distances and similarities
- Data preprocessing
- Dimensionality reduction
- Principal component analsis
- Multidimensional scaling
- Decision trees & random forests
- Bayesian classification
- Support vector machines
- Partitional: k-means, k-modes, LDF (shake & bake), k-medoids, & PAM
- Density-based: DBSCAN and density-based clustering
- Hierarchical clustering
- Bisecting k-means
- Agglomorative clustering
- Large-scale methods: BIRCH, CURE, & Chameleon
- Nonlinear dimensionality reduction
- Diffusion maps
Extra topics (slides not prepared specifically for this course):
NOTE: This webpage is outdated since it relates to a past iteration of this course.