[ml] Hadoop going forward

Vikram Oberoi voberoi at gmail.com
Sun May 23 15:43:54 PDT 2010


Ah yeah, sharing all this data via USB stick is not ideal. :-/

It appears that S3 has access controls. Take a look this post over here:
http://stackoverflow.com/questions/1529869/making-files-uploaded-to-s3-public
.

Vikram

On Sat, May 22, 2010 at 6:10 PM, Andreas von Hessling <vonhessling at gmail.com
> wrote:

> An idea:
> Could we upload our data to a central S3 location that each member of
> our ML group could access from EC3?  Or would that incur costs?  In
> other words, could we re-use the raw and pre-processed data among us
> without incurring cost?  It seems funny that each of us pre-processes
> the data individually and we share this via USB stick :-)
>
> Andy
>
>
> On Fri, May 21, 2010 at 12:24 AM, Andreas von Hessling
> <vonhessling at gmail.com> wrote:
> > Yes, I see what you're saying.  Especially since we are having ML
> > members successfully run your setup instructions -- let's go with
> > individual installations for now.  I'm really looking forward to
> > running ML algorithms on Hadoop.
> >
> > On Thu, May 20, 2010 at 5:37 PM, Vikram Oberoi <voberoi at gmail.com>
> wrote:
> >> Hey Andreas,
> >> That's a great idea -- I'll work on that for next week.
> >> As for having a common AWS account for all things ML at Noisebridge, I
> think
> >> it would be better for everyone to have individual accounts for a couple
> of
> >> reasons:
> >> - You can only run one job at a time on EMR. If you have a few people
> >> testing out hypotheses/running a few different algorithms at the same
> time,
> >> it'll become a point of contention and kill productivity.
> >> - AWS allows a user to provision 20 machines at most. If we have, say,
> 10
> >> machines per cluster, that's only 2 clusters we can do things on at any
> >> given time.
> >> - There's the payment issue. I'm not concerned that people/NB won't be
> >> willing to contribute to our we-need-machines fund, but I am concerned
> about
> >> who foots the bill when things go awry. What if we provision a high-end
> >> cluster that we're all working with one day and all of us forget to kill
> it
> >> for a week? Or, what if a bug in one of our scripts causes us to use a
> ton
> >> of incoming/outgoing S3 bandwidth? There are a bunch of things that can
> go
> >> wrong and cause us to accrue some major AWS costs, and that's when
> things
> >> get ugly.
> >> Finally, it's actually rather easy to set up your own environment where
> you
> >> can easily spin up clusters, launch jobs, and fetch results. All it
> takes is
> >> 20 minutes of (annoying) upfront work and you're good to go. With some
> >> better documentation, I can probably have you guys up and running in 5 -
> 10
> >> minutes, and I'll work on doing that.
> >> Thoughts?
> >> Vikram
> >> On Thu, May 20, 2010 at 11:56 AM, Andreas von Hessling
> >> <vonhessling at gmail.com> wrote:
> >>>
> >>> Vikram,
> >>>
> >>> From my perspective you could contribute the most in setting up a
> >>> Hadoop + Mahout infrastructure and documenting the setup process and
> >>> the hello-world mapreduce program etc.  While we went through this
> >>> yesterday (thanks) I feel like people will actually get to DO the
> >>> things they learned later; so a written reference (new wiki page)
> >>> would be great, because these questions will be asked over and over.
> >>> Even better, and this is just an idea:  can we set up a shared AWS
> >>> account so each of us doesnt have to install everything by himself?  I
> >>> know there's the question of who pays for it, but that aside, are
> >>> there technical restrictions why we could not share an account?  One
> >>> approach would be each of us throws in $10, or perhaps theres a way to
> >>> split the bills between us according to usage, or, even better we
> >>> could push Noisebridge Inc to give us some allowance.  Getting a
> >>> turnkey cloud Mahout infrastructure for Noisebridge would be H-U-G-E,
> >>> even if it would not be ready in time for KDD submission.  Feel free
> >>> to take the lead on that initiative.  You would go down in the history
> >>> books of NB as a hero :-)
> >>>
> >>> Erin and Mike are already working on transforming the data, so I think
> >>> we have already lots of manpower on that end.
> >>>
> >>> Let's tentatively plan this Sunday night to get together again.  Erin
> >>> also mentioned she'd like to meet again before the next Wednesday.  I
> >>> can give an impromptu talk about classifiers/machine learning problem
> >>> setups.
> >>> Will confirm.
> >>>
> >>> Andy
> >>>
> >>>
> >>>
> >>> On Wed, May 19, 2010 at 11:38 PM, Vikram Oberoi <voberoi at gmail.com>
> wrote:
> >>> > Hey folks,
> >>> > For those of you that came out tonight, I hope the code I walked
> through
> >>> > and
> >>> > initial (albeit rough) overview of MapReduce helped. If you guys have
> >>> > any
> >>> > questions or requests, the best way to ask would be to:
> >>> > a) direct an email to me over ml at lists.noisebridge.net or...
> >>> > b) open an issue at the Github
> >>> > project: http://github.com/voberoi/hadoop-mrutils
> >>> > Both of these ways someone else might be able to answer first and
> >>> > everyone
> >>> > will benefit from the answer, as there's a high probability that
> >>> > everyone
> >>> > will have the same questions.
> >>> > For next week, I'm going to write a script that transforms the KDD
> >>> > dataset
> >>> > in... some useful way. Your guys' input on what exactly I should do
> here
> >>> > is
> >>> > most welcome. The transformation should be involved enough that the
> code
> >>> > can
> >>> > serve as an example for scripts you all might implement later.
> >>> > I'll also be taking a look at Apache Mahout (a library containing
> Hadoop
> >>> > MapReduce implementations of numerous machine learning algorithms)
> and
> >>> > writing up an example of how to use it. If you have a particular
> >>> > algorithm
> >>> > that you want to apply to the dataset, check if it's in the Mahout
> >>> > library
> >>> > and let me know.
> >>> > Finally, is any brainstorming/discussion about what we're doing
> >>> > happening
> >>> > anywhere other than the meetups? I'd be happy to meet again some time
> >>> > before
> >>> > next Wednesday to hash out some ideas and run with them, as in-person
> >>> > conversation bandwidth is *so* much higher. Alternately, we could
> throw
> >>> > out
> >>> > ideas on the list and brainstorm over email threads. It doesn't seem
> >>> > like
> >>> > there's a whole lot of action on the wiki other than links to
> resources
> >>> > and
> >>> > TODOs. Or is there?
> >>> > Vikram
> >>> > _______________________________________________
> >>> > ml mailing list
> >>> > ml at lists.noisebridge.net
> >>> > https://www.noisebridge.net/mailman/listinfo/ml
> >>> >
> >>> >
> >>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.noisebridge.net/pipermail/ml/attachments/20100523/6b362746/attachment.htm 


More information about the ml mailing list