Machine Learning/Hadoop: Difference between revisions
Jump to navigation
Jump to search
ThomasLotze (talk | contribs) (Created page with '===About=== ===Tutorials=== * http://www.cloudera.com/videos/introduction_to_pig ===Tools===') |
ThomasLotze (talk | contribs) No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
===About=== | ===About=== | ||
* Google had so much data that ''reading'' the data from disk took a lot of time, much less processing | |||
** So they needed to parallelize everything, even disk access | |||
** Make the processing local to where the data is, to avoid network issues | |||
* Parallelization is hard/error-prone | |||
** Want to have a "shared-nothing" architecture | |||
** Functional programming | |||
* Map | |||
Runs the function on each item in the list, returns the list of output from running the function on each item | |||
<pre> | |||
def map(func, list): | |||
return [func(item) for item in list] | |||
</pre> | |||
Example: | |||
<pre> | |||
def twice(num): | |||
return num*2 | |||
</pre> | |||
* Reduce | |||
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating | |||
<pre> | |||
def reduce(func, list): | |||
a = func(list[0], list[1]) | |||
for | |||
</pre> | |||
===Examples/Actual=== | |||
<pre> | |||
def map(key,value): | |||
# process | |||
emit(another_key, another_value) | |||
def reduce(key, values): | |||
# process the key and all values associated with it | |||
emit(something) | |||
</pre> | |||
* Average | |||
** keys are line numbers, values are what's in it | |||
** file: | |||
*** 1 (1,2) | |||
*** 4 (2,4) | |||
*** 5 (3,5) | |||
*** 6 (4,6) | |||
<pre> | |||
def map(key,value): | |||
emit("exist",1) | |||
emit("x",value) | |||
def reduce(key, values): | |||
# process the key and all values associated with it | |||
emit(something) | |||
</pre> | |||
===Tutorials=== | ===Tutorials=== | ||
* http://www.cloudera.com/videos/introduction_to_pig | * http://www.cloudera.com/videos/introduction_to_pig | ||
===How to Debug=== | |||
* To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer | |||
** Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py | |||
===Tools=== | ===Tools=== | ||
* Hadoop | |||
* Hive | |||
* Pig: A high-level language for compiling down to MapReduce programs | |||
* MapReduce on Amazon (?) |
Latest revision as of 21:05, 9 June 2010
About[edit]
- Google had so much data that reading the data from disk took a lot of time, much less processing
- So they needed to parallelize everything, even disk access
- Make the processing local to where the data is, to avoid network issues
- Parallelization is hard/error-prone
- Want to have a "shared-nothing" architecture
- Functional programming
- Map
Runs the function on each item in the list, returns the list of output from running the function on each item
def map(func, list): return [func(item) for item in list]
Example:
def twice(num): return num*2
- Reduce
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating
def reduce(func, list): a = func(list[0], list[1]) for
Examples/Actual[edit]
def map(key,value): # process emit(another_key, another_value) def reduce(key, values): # process the key and all values associated with it emit(something)
- Average
- keys are line numbers, values are what's in it
- file:
- 1 (1,2)
- 4 (2,4)
- 5 (3,5)
- 6 (4,6)
def map(key,value): emit("exist",1) emit("x",value) def reduce(key, values): # process the key and all values associated with it emit(something)
Tutorials[edit]
How to Debug[edit]
- To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
- Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py
Tools[edit]
- Hadoop
- Hive
- Pig: A high-level language for compiling down to MapReduce programs
- MapReduce on Amazon (?)