You can download a complete pdf copy of “Mining of Massive Datasets” by Anand Rajaraman and Jeffrey David Ullman from their website. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike. It teaches algorithms that have been used in practice to solve key problems in data mining and includes exercises suitable for students from the advanced undergraduate level and beyond.
At the highest level of description, this book is about data mining. However,it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our
examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some
sort. The principal topics covered are:
1. Distributed file systems and map-reduce as a tool for creating parallel
algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-
3. Data-stream processing and specialized algorithms for dealing with data
that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam
detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the
A-Priori Algorithm and its improvements.
6. Algorithms for clustering very large, high-dimensional datasets.
7. Two key problems for Web applications: managing advertising and rec-
Table of Contents
- Data Mining
- Large-Scale File Systems and Map-Reduce
- Finding Similar Items
- Mining Data Streams
- Link Analysis
- Frequent Itemsets
- Advertising on the Web
- Recommendation Systems
File size: 2.63 MB
Number of pages: 457