Machine Learning/Hadoop
Revision as of 04:05, 10 June 2010 by ThomasLotze (talk | contribs)
About[edit]
- Google had so much data that reading the data from disk took a lot of time, much less processing
- So they needed to parallelize everything, even disk access
- Make the processing local to where the data is, to avoid network issues
- Parallelization is hard/error-prone
- Want to have a "shared-nothing" architecture
- Functional programming
- Map
Runs the function on each item in the list, returns the list of output from running the function on each item
def map(func, list): return [func(item) for item in list]
Example:
def twice(num): return num*2
- Reduce
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating
def reduce(func, list): a = func(list[0], list[1]) for
Examples/Actual[edit]
def map(key,value): # process emit(another_key, another_value) def reduce(key, values): # process the key and all values associated with it emit(something)
- Average
- keys are line numbers, values are what's in it
- file:
- 1 (1,2)
- 4 (2,4)
- 5 (3,5)
- 6 (4,6)
def map(key,value): emit("exist",1) emit("x",value) def reduce(key, values): # process the key and all values associated with it emit(something)
Tutorials[edit]
How to Debug[edit]
- To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
- Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py
Tools[edit]
- Hadoop
- Hive
- Pig: A high-level language for compiling down to MapReduce programs
- MapReduce on Amazon (?)