[ml] KDD cup submission status

Andreas von Hessling vonhessling at gmail.com
Tue Jun 8 10:21:53 PDT 2010

Hi Thomas,

how's it going, what's your status?  Are you still working on this?
Have you attempted to submit your results on your own? What's your

On my side, I'm finally finishing discretization of all numeric
features and will be pushing the data through the incremental NB
classifier.  Initial attempts have resulted only in mediocre
performance.  The skills may be the key to good scores. This is also
suggested by the "fact sheet" questionnaire that they have put up
(pasted below) that asks revealing questions.

Here's the features, * have been discretized
   row INT,
   studentid VARCHAR(30),
   problemhierarchy TEXT,
   problemname TEXT, (this has many thousand nominal values; may
ignore this for the ML algorithms)
*   problemview INT,
   problemstepname TEXT,
   cfa INT,
*   iq REAL,
*   iqstrength REAL,
*   chance REAL,
*   chancestrength REAL,
*   numsub REAL, (number of subskills required for this step)
*   numtraced REAL (number of traced skills)

I can provide this dataset.

Depending on the datasetsize, I may try to push it through libsvm.
Also, I'll try MOA; so I may have a few questions on running both of
them later.



Title of the contribution*

Provide a title for your team's contribution that will appear in the results.
Supplementary online material

Provide a URL to a web page, technical memorandum, or a paper.

Provide a general summary with relevant background information: Where
does the method come from? Is it novel? Name the prior art.
Used Weka to extensively preprocess data; bash scripts; attempted
Weka's incremental classifiers (e.g. Naive Bayes Updateable) to
provide predictions with the large amounts of data. No new ML

Summarize the algorithms you used in a way that those skilled in the
art should understand what to do. Profile of your methods as follows:
Data exploration and understanding

Did you use data exploration techniques to

Identify selection biases
Identify temporal effects (e.g. students getting better over time)
Understand the variables
Explore the usefulness of the KC models
Understand the relationships between the different KC types

Please describe your data understanding efforts, and interesting observations:
Student IQ = % correct for each student: very valuable variable,
lifted us into the 50th percentile of submissions. Many features are
not available in test set, so they have been removed. It seems
analysis of KC models (not performed) is necessary to get into the top

Feature generation

Features designed to capture the step type (e.g. enter given, or ... )
Features based on the textual step name
Features designed to capture the KC type
Features based on the textual KC name
Features derived from opportunity counts
Features derived from the problem name
Features based on student ID
Other features

Details on feature generation:
Student IQ = % correct by student Step chance = % correct attempts
IQ/chance strength = total counts of attempts. % of features required
in each step.
Feature selection

Feature ranking with correlation or other criterion (specify below)
Filter method (other than feature ranking)
Wrapper with forward or backward selection (nested subset method)
Wrapper with intensive search (subsets not nested)
Embedded method
Other method not listed above (specify below)

Details on feature selection:
Did you attempt to identify latent factors?

Cluster students
Cluster knowledge components
Cluster steps
Latent feature discovery was performed jointly with learning

Details on latent factor discovery (techniques used, useful
student/step features, how were the factors used, etc.):
Other preprocessing

Filling missing values (for KC)
Principal component analysis

More details on preprocessing:

Base classifier

Decision tree, stub, or Random Forest
Linear classifier (Fisher's discriminant, SVM, linear regression)
Non-linear kernel method (SVM, kernel ridge regression, kernel
logistic regression)
Bayesian Network (other than Naïve Bayes)
Neural Network
Bayesian Neural Network
Nearest neighbors
Latent variable models (e.g. matrix factorization)
Neighborhood/correlation based collaborative filtering
Bayesian Knowledge Tracing
Additive Factor Model
Item Response Theory
Other classifier not listed above (specify below)
Loss Function

Hinge loss (like in SVM)
Square loss (like in ridge regression)
Logistic loss or cross-entropy (like in logistic regression)
Exponential loss (like in boosting)
Don't know
Other loss (specify below)

One-norm (sum of weight magnitudes, like in Lasso)
Two-norm (||w||^2, like in ridge regression and regular SVM)
Structured regularizer (like in group lasso)
Don't know
Other (specify below)
Ensemble Method

Bagging (check this if you use Random Forest)
Other ensemble method
Were you able to use information present only in the training set?

Corrects, incorrects, hints
Step start/end times
Did you use post-training calibration to obtain accurate probabilities?

Did you make use of the development data sets for training?


Details on classification:
Model selection/hyperparameter selection

We used the online feedback of the leaderboard.
K-fold or leave-one-out cross-validation (using training data)
Virtual leave-one-out (closed for estimations of LOO with a single
classifier training)
Out-of-bag estimation (for bagging methods)
Bootstrap estimation (other than out-of-bag)
Other cross-validation method
Bayesian model selection
Penalty-based method (non-Bayesian)
Bi-level optimization
Other method not listed above (specify below)

Details on model selection:

A reader should also know from reading the fact sheet what the
strength of the method is.

Please comment about the following:
Quantitative advantages (e.g., compact feature subset, simplicity,
computational advantages).

Qualitative advantages (e.g. compute posterior probabilities,
theoretically motivated, has some elements of novelty).

Other methods. List other methods you tried.

How helpful did you find the included KC models?

Crucial in getting good predictions
Somewhat helpful in getting good predictions
Not particularly helpful
If you learned latent factors, how helpful were they?

Crucial in getting good predictions
Somewhat helpful in getting good predictions
Not particularly helpful

Details on the relevance of the KC models and latent factors:
Software Implementation


Proprietary in-house software
Commercially available in-house software
Freeware or shareware in-house software
Off-the-shelf third party commercial software
Off-the-shelf third party freeware or shareware

Other (specify below)

Details on software implementation:
Hardware implementation


Linux or other Unix
Mac OS
Other (specify below)

<= 2 GB
<= 8 GB
>= 8 GB
>= 32 GB

Multi-processor machine
Run in parallel different algorithms on different machines
Other (specify below)

Details on hardware implementation. Specify whether you provide a self
contained-application or libraries.
Code URL

Provide a URL for the code (if available):
Competition Setup

More information about the ml mailing list