Slide 1: Lecture 17: MapReduce

Principles of Computer Systems

Winter 2020

Stanford University

Computer Science Department

Instructors: Chris Gregg

                            Nick Troccoli

 

PDF of this presentation

Slide 2:  The Genesis of Datacenter Computing: MapReduce

Image:

PDF of "Mapreduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, Google.

End of Image

Slide 3:  The Genesis of Datacenter Computing: MapReduce

Slide 4:  The Genesis of Datacenter Computing: MapReduce

  1. Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.
  2. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
  3. Reduce: worker nodes now process each group of output data, per key, in parallel.

 

(https://en.wikipedia.org/wiki/MapReduce)

Slide 5: Core Data Abstraction: Key/Value Pairs

map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(v2)
map(String key, String value):
  // key: document name
  // value: document contents
  for word w in value:
    EmitIntermediate(w,"1")
    
reduce(String key, List values):
  // key: a word
  // values: a list of counts
  int result = 0
  for v in values:
    result += ParseInt(v)
  Emit(AsString(result))

"The number of partitions (R) and the partitioning function are specified by the user."

("the", "1") , ("the", "1"), ("The", "1"), ("of", 1),

(number, "1"), ...

"the", ("1", "1") -> "2"

"The", ("1") -> "1"

"of", ("1") -> "1"

"number", ("1") -> "1"

 

 map output

input

 reduce

Slide 6: Key/Value Pairs: How and Where

map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(v2)

Slide 7: MapReduce System Architecture

Slide 8: Your MapReduce

mapper

reducer

Google

mapper

reducer

Your System

AFS

Slide 9: MapReduce Data Flow

Slide 10: Word Count Example

 12 input partitions

chunks of file

map tasks

1

2

11

12

map(String key, String value):
  // key: document name
  // value: document contents
  for word w in value:
    EmitIntermediate(w,"1")
    
reduce(String key, List values):
  // key: a word
  // values: a list of counts
  int result = 0
  for v in values:
    result += ParseInt(v)
  Emit(AsString(result))

Slide 11: Word Count Example

 12 input partitions

"Propitious Pallas, to secure..."

map tasks

1

2

11

12

 240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

This file contains all of the map outputs for map task 7 whose key, modulo 20, is 5

 

in 1

now 1

step 1

now 1

in 1

in 1

....

 

7.5.mapped

Slide 12: Word Count Example

 12 input partitions

"Propitious Pallas, to secure..."

map tasks

1

2

11

12

 240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

7.5.mapped

5.grouped

This file contains all of the reduce inputs for every key whose hash % 20 == 5

 

in 1 1 1 1 1 1 1 1 .. (1149 times)

now 1 1 1 1 ... (126 times)

step 1 1 1 1 ... (10 times)

....

 

Slide 13: Word Count Example

 12 input partitions

"Propitious Pallas, to secure..."

map tasks

1

2

11

12

 240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

7.5.mapped

5.grouped

0

1

5

19

5.output

Contains the count of words whose hash % 20 == 5

 

in 1149

now 126

step 10

reduce tasks

Slide 14: Importance of Keys

 

 

 

 

 

 

 

 

 

 

 

 

Slide 15: MapReduce Assignment: Example

myth57:$ make filefree
rm -fr files/intermediate/* files/output/*

myth57:$ ./samples/mr_soln --mapper ./samples/mrm_soln --reducer ./samples/mrr_soln --config odyssey-full.cfg

Determining which machines in the myth cluster can be used... [done!!]
Mapper executable: word-count-mapper
Reducer executable: word-count-reducer
Number of Mapping Workers: 8
Number of Reducing Workers: 4
Input Path: /usr/class/cs110/samples/assign8/odyssey-full
Intermediate Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/intermediate
Output Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/output
Server running on port 48721

Received a connection request from myth59.stanford.edu.
Incoming communication from myth59.stanford.edu on descriptor 6.
Instructing worker at myth59.stanford.edu to process this pattern: "/usr/class/cs110/samples/assign8/odyssey-full/00001.input"
Conversation with myth59.stanford.edu complete.
Received a connection request from myth61.stanford.edu.
Incoming communication from myth61.stanford.edu on descriptor 7.

... LOTS of lines removed

Remote ssh command on myth56 executed and exited with status code 0.
Reduction of all intermediate chunks now complete.
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/output/00000.output hashes to 13585898109251157014
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/output/00001.output hashes to 1022930401727915107
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/output/00002.output hashes to 9942936493001557706
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/winter-2020/assignments/assign8/files/output/00003.output hashes to 5127170323801202206

... more lines removed

Server has shut down.

 

 

 

 

 

 

 

 

 

 

 

Slide 16: MapReduce Assignment: Mapped File Contents

myth57:$ ls -lu files/intermediate/
total 858
-rw------- 1 cgregg operator 2279 May 29 09:29 00001.00000.mapped
-rw------- 1 cgregg operator 1448 May 29 09:29 00001.00001.mapped
-rw------- 1 cgregg operator 1927 May 29 09:29 00001.00002.mapped
-rw------- 1 cgregg operator 2776 May 29 09:29 00001.00003.mapped
-rw------- 1 cgregg operator 1071 May 29 09:29 00001.00004.mapped
...lots removed
-rw------- 1 cgregg operator  968 May 29 09:29 00012.00027.mapped
-rw------- 1 cgregg operator 1720 May 29 09:29 00012.00028.mapped
-rw------- 1 cgregg operator 1686 May 29 09:29 00012.00029.mapped
-rw------- 1 cgregg operator 2930 May 29 09:29 00012.00030.mapped
-rw------- 1 cgregg operator 2355 May 29 09:29 00012.00031.mapped
myth57:$ head -10 files/intermediate/00012.00028.mapped
thee 1
rest 1
thee 1
woes 1
knows 1
grieve 1
sire 1
laertes 1
sire 1
power 1

 

 

 

 

 

 

Slide 17: MapReduce Assignment: Hashing

myth57:$ head -10 files/intermediate/00005.00028.mapped
vain 1
must 1
strand 1
cry 1
herself 1
she 1
along 1
head 1
dayreflection 1
thee 1
myth57:$ ./hasher thee 32
28

 

 

 

 

 

 

 

 

Slide 18: MapReduce Assignment: Starter Code

myth57:~$ make directories filefree
// make command listings removed for brevity
myth57:~$ make
// make command listings removed for brevity
myth57:~$ ./mr --mapper ./mrm --reducer ./mrr --config odyssey-full.cfg --map-only --quiet 
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00001.mapped hashes to 2579744460591809953
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00002.mapped hashes to 15803262022774104844
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00003.mapped hashes to 15899354350090661280
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00004.mapped hashes to 15307244185057831752
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00005.mapped hashes to 13459647136135605867
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00006.mapped hashes to 2960163283726752270
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00007.mapped hashes to 3717115895887543972
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00008.mapped hashes to 8824063684278310934
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00009.mapped hashes to 673568360187010420
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00010.mapped hashes to 9867662168026348720
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00011.mapped hashes to 5390329291543335432
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00012.mapped hashes to 13755032733372518054
myth57:~$ 
myth57:~$ $ ls -l files/intermediate/
total 655
-rw------- 1 cgregg operator 76280 May 29 10:26 00001.mapped
-rw------- 1 cgregg operator 54704 May 29 10:26 00002.mapped
-rw------- 1 cgregg operator 53732 May 29 10:26 00003.mapped
-rw------- 1 cgregg operator 53246 May 29 10:26 00004.mapped
-rw------- 1 cgregg operator 53693 May 29 10:26 00005.mapped
-rw------- 1 cgregg operator 53182 May 29 10:26 00006.mapped
-rw------- 1 cgregg operator 54404 May 29 10:26 00007.mapped
-rw------- 1 cgregg operator 53464 May 29 10:26 00008.mapped
-rw------- 1 cgregg operator 53143 May 29 10:26 00009.mapped
-rw------- 1 cgregg operator 53325 May 29 10:26 00010.mapped
-rw------- 1 cgregg operator 53790 May 29 10:26 00011.mapped
-rw------- 1 cgregg operator 52207 May 29 10:26 00012.mapped

 

 

$ grep -l "^thee " files/intermediate/*.mapped \
          | wc -l
11
myth57:$ ls -l files/output/
total 0

Slide 19: MapReduce questions

 

 

 

 

 

 

 

 

Slide 20: Centralized vs. Distributed

Slide 21: Centralized vs. Distributed

"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Slide 22: Centralized Bottleneck

Slide 23: Decentralized Approach (e.g., Naiad, MPI)

Slide 24: Idea: Cache Control Plane Messages

"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Slide 25: Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Slide 26: Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Slide 27: Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)