[ml] Edit distance data and code

Adam Bossy adambossy at gmail.com
Mon Jul 12 13:12:56 PDT 2010

Folks, I've pushed the edit distance data and code to calculate it to
our git repository. There are four files total:

1) src/cluster_rtseq.py - The code that computes the levenshtein
distance between string pairs and calls the scipy clustering algorithm
2) src/sample.py - Sample code for hierarchical clustering with scipy
3) src/print_matrix.py - Print the edit distance matrix (per Theo and
Erin's request)
4) data/similarity_matrix.csv - The output for print_matrix.py. Feel
free to tweak the python file to match your needs

You'll need to install scipy and numpy to run any of this code.

I'm doing this on a remote slice -- if anybody can get this running
with the matplotlib package for visualizations, that would be great.
We could then visualize the dendrogram output. I messed up the Python
install on my Mac so I won't be able to set it up without going
through the painful process of fixing it.


More information about the ml mailing list