Machine Learning/Hadoop

From Noisebridge
< Machine Learning(Difference between revisions)
Jump to: navigation, search
 
(One intermediate revision by one user not shown)
Line 1: Line 1:
 
===About===
 
===About===
 +
* Google had so much data that ''reading'' the data from disk took a lot of time, much less processing
 +
** So they needed to parallelize everything, even disk access
 +
** Make the processing local to where the data is, to avoid network issues
 +
* Parallelization is hard/error-prone
 +
** Want to have a "shared-nothing" architecture
 +
** Functional programming
 +
* Map
 +
Runs the function on each item in the list, returns the list of output from running the function on each item
 +
<pre>
 +
def map(func, list):
 +
  return [func(item) for item in list]
 +
</pre>
 +
Example:
 +
<pre>
 +
def twice(num):
 +
  return num*2
 +
</pre>
 +
* Reduce
 +
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating
 +
<pre>
 +
def reduce(func, list):
 +
  a = func(list[0], list[1])
 +
  for
 +
</pre>
 +
 +
===Examples/Actual===
 +
<pre>
 +
def map(key,value):
 +
  # process
 +
  emit(another_key, another_value)
 +
def reduce(key, values):
 +
  # process the key and all values associated with it
 +
  emit(something)
 +
</pre>
 +
 +
* Average
 +
** keys are line numbers, values are what's in it
 +
** file:
 +
*** 1  (1,2)
 +
*** 4  (2,4)
 +
*** 5  (3,5)
 +
*** 6  (4,6)
 +
<pre>
 +
def map(key,value):
 +
  emit("exist",1)
 +
  emit("x",value)
 +
def reduce(key, values):
 +
  # process the key and all values associated with it
 +
  emit(something)
 +
</pre>
 +
 +
 +
  
 
===Tutorials===
 
===Tutorials===
 
* http://www.cloudera.com/videos/introduction_to_pig
 
* http://www.cloudera.com/videos/introduction_to_pig
  
 +
===How to Debug===
 +
* To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
 +
** Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py
  
 
===Tools===
 
===Tools===

Latest revision as of 21:05, 9 June 2010

Contents

[edit] About

  • Google had so much data that reading the data from disk took a lot of time, much less processing
    • So they needed to parallelize everything, even disk access
    • Make the processing local to where the data is, to avoid network issues
  • Parallelization is hard/error-prone
    • Want to have a "shared-nothing" architecture
    • Functional programming
  • Map

Runs the function on each item in the list, returns the list of output from running the function on each item

def map(func, list):
  return [func(item) for item in list]

Example:

def twice(num):
  return num*2
  • Reduce

Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating

def reduce(func, list):
  a = func(list[0], list[1])
  for 

[edit] Examples/Actual

def map(key,value):
  # process
  emit(another_key, another_value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)
  • Average
    • keys are line numbers, values are what's in it
    • file:
      • 1 (1,2)
      • 4 (2,4)
      • 5 (3,5)
      • 6 (4,6)
def map(key,value):
  emit("exist",1)
  emit("x",value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)



[edit] Tutorials

[edit] How to Debug

  • To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
    • Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py

[edit] Tools

  • Hadoop
  • Hive
  • Pig: A high-level language for compiling down to MapReduce programs
  • MapReduce on Amazon (?)
Personal tools