[ml] Kaggle HIV update
mike at mindmech.com
Tue Jun 22 17:27:59 PDT 2010
Unfortunately I don't think the sequences are amino acid sequences.
For the PR sequences, most of them have a length of 297. If it's a
DNA sequence, then this means it codes for 99 amino acids. A quick
look shows that HIV-1 Protease (the protein whose sequence we're
dealing with in the first sequence column) has 99 amino acid pairs:
Does that make sense? If it does, then the sequences from the data are
just noisy and of poor quality, and we're going to have to throw out some
of the noisy data before running it through a sequence aligner. I'm in the
process of doing this now, and will let everyone know how things are coming
along at the meeting.
See everyone tonight!
On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com> wrote:
> It looks like the sequences are already coded in terms of amino acids
> rather than nucleotide triples? <
> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze <thomas.lotze at gmail.com>wrote:
>> I committed some python for generating base pair triplet count features,
>> and R code for determining frequency and doing a basic GLM including the
>> most frequent triplets.
>> (The Noisebridge machine learning sourceforge git repository is here:
>> https://sourceforge.net/scm/?type=git&group_id=326816 To download the
>> files, run "git clone git://
>> or, better yet, ask Mike to give you read/write access to this project so
>> you can upload code as well)
>> This got me to 53.8462 MCE, 36th out of 49 teams.
>> See you tomorrow night at 9 for fun with Hadoop!
>> ml mailing list
>> ml at lists.noisebridge.net
> ml mailing list
> ml at lists.noisebridge.net
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ml