The group-by-key process's postcondition
- The following Python script is a short (but dense) program that reads from an incoming stream of key-value pairs, sorted by key, and outputs the same content, save for the fact that all lines with the same key have been collapsed to a single line, where all values themselves have been collapsed to a single vector-of-values presentation:
#!/usr/bin/env python
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file):
for line in file:
yield line.strip().split(' ')
def main():
data = read_mapper_output(sys.stdin)
for key, keygroup in groupby(data, itemgetter(0)):
values = ' '.join(sorted(v for k, v in keygroup))
print "%s %s" % (key, values)
if __name__ == "__main__":
main()
- The sorted output of the problem-specific mapper could be fed to the above script, as with this:
myth22> more anna-karenina.txt | ./word-count-mapper.py | sort | ./group-by-key.py
a 1 1 1 1 1 // plus 6064 more 1's on this same line
abandon 1 1 1 1 1 1
abandoned 1 1 1 1 1 1 1 1 1
abandonment 1
abashed 1 1
abasing 1
aber 1
abilities 1
...
zaraisky 1 1 1 1
zeal 1 1 1
zealously 1
zest 1
zhivahov 1
zigzag 1
zoological 1 1
zoology 1
zu 1