Machine Learning Meetup Notes:2011-4-13

From Noisebridge
Revision as of 20:23, 13 April 2011 by Mschachter (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Anthony Goldbloom from Kaggle Visits

  • Guy used random forests to win HIV competition. Word "random forests" is trademarked. Dude taught himself machine learning from watching youtube videos. Random forests are pretty robust to new data.
    • Used caret package in R to deal with random forests.
  • Kaggle splits test dataset into two, uses half for leaderboard.
  • Often score difference between winning model and second place is not statistically significant. So they award prizes to top few. Might impose restrictions on execution time of model.
  • Performance bottoms out in competitions within a few weeks in general. This seems to be due to all the information being "squeezed" out of the dataset at that point.
  • Chess rating competition: build a new rating system that more accurately produces the results. The performance still plateaued, but took longer.
  • Most users of kaggle are from computer science and statistics, followed by economics, math, biostats.
  • Tools people use:
    • R: lots of american users
    • Matlab
    • SAS
    • Weka
    • SPSS
    • Python: although it's lower on the list, people are successful with it
  • R packages used: Caret, RFE, GLM, NNET, Forecast
  • Heritage Prize
    • Real shit is going down may 4th, with release of all datasets.
    • Ends in 2 years. No rush.
    • Four prizes in total, given out throughout the next two years.