Mining Data Streams

Apache Spark

Frequent Itemsets Mining

Locality-Sensitive Hashing

Clustering

Dimensionality Reduction

Recommender Systems

PageRank

Community Detection in Graphs

Graph Representation Learning

Graph neural network (GNN)

Learning Embeddings

Decision Trees

Matrix Sketching

Computational Advertising

Learning through Experimentation

Optimizing Submodular Functions

CS246: Mining Massive Data Sets Stanford University Spring 2023 This course focuses on data mining and machine learning algorithms for large scale data analysis. The emphasis is on parallel algorithms with tools like MapReduce and Spark. Topics include frequent itemsets, locality sensitive hashing, clustering, link analysis, and large-scale supervised machine learning. Familiarity with Java, Python, basic probability theory, linear algebra, and algorithmic analysis is required. ### What is this course about? [[Info Handout](https://web.stanford.edu/class/cs246/handouts/CS246_Info_Handout.pdf)]

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and [Spark](http://spark.apache.org/) as tools for creating parallel algorithms that can process very large amounts of data.

**Topics include**: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.  Students are expected to have the following background:

- Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
- Good knowledge of Java and Python will be extremely helpful since most assignments will require the use of Spark.
- Familiarity with basic probability theory (CS109 or Stat116 or equivalent is sufficient but not necessary).
- Familiarity with writing rigorous proofs (at a minimum, at the level of CS 103).
- Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263 would be much more than necessary).
- Familiarity with algorithmic analysis (e.g., CS 161 would be much more than necessary). ### Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.

[Leskovec-Rajaraman-Ullman: Mining of Massive Dataset](http://www.mmds.org/)

CS246: Mining Massive Data Sets

Machine Learning

Data stream mining

Mining Data Streams

1 courses cover this concept

CS246: Mining Massive Data Sets