CPSC/AMTH 445/545 - Introduction to Data Mining - Fall 2017 Yale
CPSC/AMTH 445/545
Introduction to Data Mining
Fall 2017
Instructor: Guy Wolf (guy.wolf@yale.edu)
TA: Jay Stanley (jay.stanley@yale.edu)
ULAs: Tyler Dohrn (tyler.dohrn@yale.edu) & Scott Stankey (scott.stankey@yale.edu)
The ability to process and extract insightful information from large amounts of data has become a desired, if not necessary, skill in almost every field of industry and science.
Among other benefits, such information can provide useful knowledge, support decision-making, uncover hidden trends, and enable deeper understanding of observed phenomena.
This course will cover some of the main problems and challenges encountered in data analysis and applications, and provide fundamental tools and techniques for solving them.
We will discuss popular algorithms for data organization & visualization, such as principal component analysis (PCA) and multidimensional scaling (MDS).
Students will become familiar with a variety of machine learning and data mining approaches.
These will include both supervised approaches, such as performing classification (e.g., with decision trees, Bayesian classifiers, and SVM), and unsupervised ones, such as clustering data (e.g., with k-means, density estimators, and linkage-based agglomeration).
The lectures and discussions in class will be accompanied by homework exercises that combine theoretical questions, which emphasize the understanding of underlying data mining principles, together with programming tasks (e.g., in MatLab and/or Python) that demonstrate practical implementations of studied data mining techniques.
Grades in this course will be based on these exercises, a project, and an exam.
The course assumes basic prior knowledge in probabilities, linear algebra, data structures, algorithms, and programming.
Meetings
Lectures:
Tuesdays & Thursdays 1:00-2:15
Discussions:
Wednesdays 6:00 PM, AKW 307
Office Hours:
Instructor: Fridays 4:00 PM - 6:00 PM, AKW 103
TA: Wednesdays 10:00 AM - 12:00 PM, AKW 307, or by appointment
ULA: Mondays 4:00 PM - 6:00 PM, HLH17 zoo annex
Textbooks
No required textbook, but the following books are recommended for the course:
Topics
This is a tentative list of topics we intend to cover, which may change as we progress through the course:
- Introduction to data mining tasks
- Data exploration and visualization
- Distances and similarities
- Data preprocessing
- Dimensionality reduction
- Principal component analsis
- Multidimensional scaling
- Classification
- Decision trees & random forests
- Bayesian classification
- Support vector machines
- Clustering
- Partitional: k-means, k-modes, LDF (shake & bake), k-medoids, & PAM
- Density-based: DBSCAN and density-based clustering
- Hierarchical clustering
- Bisecting k-means
- Agglomorative clustering
- Large-scale methods: BIRCH, CURE, & Chameleon
- Nonlinear dimensionality reduction
- Isomap
- Diffusion maps
Slides:
Extra topics (slides not prepared specifically for this course):
Final grade composition:
The final grade in this class will be based on three components:
- 30% -- in-class exam
- 30% -- final project
- 4 x 10% -- the best four exercises out of a total of five (i.e., lowest score is dropped, and each remaining score accounts for 10% of the final grade).
In-class exam:
- An in-class exam will be held on the last Tuesday of the semester.
- Exact date & time: Tuesday, Dec. 5, 1:00 PM.
- The exam will include all the materials shown in class, including:
- Course topic slides: Topics 1-10
- Extra topic slides: the first part of the IPAM totorial, covering Isomap together with linear, nonlinear, and graph-based dimensionality reduction methods.
- No books, notes, or other materials will be allowed near your desk during the exam.
Group projects:
- For the course project, students will form groups of 2-4 members.
- Each group should designate a person of contact.
- By Thursday, Oct. 12, 9:00 PM, each group should submit a project proposal via a Canvas (instructions will be posted soon).
- Proposals should include: Project description & goals, planned contributions of each team member, and used data & data sources.
- Each group should sign up for a 20 minute review meeting with the TA and ULAs prior to the October break to discuss their proposed project.
- Mettings will take place on Monday, Oct 9th, HLH17 zoo annex.
- Sign up for meetings via the following Google form .
Exercises:
Notice that the top four grades will be used when computing the final grade in the course, so you can skip at most one exercise during the semester.
- Exercise 01 - due by Thursday, Sep. 21st, 1:00 PM.
- Exercise 02 - due by Thursday, Oct. 5th, 1:00 PM.
- Exercise 03 - due by Friday, Oct. 27th, 5:00 PM.
- Problem 4:
- Problem 5:
- Problem 6:
- Exercise 04 - due by Monday, Nov. 20th, 5:00 PM.
- Problem 3:
- Problem 4:
- Problem 5:
- Problem 6:
- Exercise 05 - due by Friday, Dec. 8th, 5:00 PM.
- Problem 2:
- Problem 3:
- Problem 5:
- Problem 6: