Machine Learning/Hadoop: Difference between revisions

Latest revision as of 21:05, 9 June 2010

About[edit]

Google had so much data that reading the data from disk took a lot of time, much less processing
- So they needed to parallelize everything, even disk access
- Make the processing local to where the data is, to avoid network issues
Parallelization is hard/error-prone
- Want to have a "shared-nothing" architecture
- Functional programming
Map

Runs the function on each item in the list, returns the list of output from running the function on each item

def map(func, list):
  return [func(item) for item in list]

Example:

def twice(num):
  return num*2

Reduce

Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating

def reduce(func, list):
  a = func(list[0], list[1])
  for

Examples/Actual[edit]

def map(key,value):
  # process
  emit(another_key, another_value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)

Average
- keys are line numbers, values are what's in it
- file:
  - 1 (1,2)
  - 4 (2,4)
  - 5 (3,5)
  - 6 (4,6)

def map(key,value):
  emit("exist",1)
  emit("x",value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)

Tutorials[edit]

http://www.cloudera.com/videos/introduction_to_pig

How to Debug[edit]

To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
- Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py

Tools[edit]

Hadoop
Hive
Pig: A high-level language for compiling down to MapReduce programs
MapReduce on Amazon (?)

@@ Line 57: / Line 57: @@
 * http://www.cloudera.com/videos/introduction_to_pig
+===How to Debug===
+* To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
+** Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py
 ===Tools===

Machine Learning/Hadoop: Difference between revisions

Latest revision as of 21:05, 9 June 2010

Contents

About[edit]

Examples/Actual[edit]

Tutorials[edit]

How to Debug[edit]

Tools[edit]

Navigation menu

Machine Learning/Hadoop: Difference between revisions

Latest revision as of 21:05, 9 June 2010

About[edit]

Examples/Actual[edit]

Tutorials[edit]

How to Debug[edit]

Tools[edit]

Navigation menu

Search