[ml] Kaggle HIV update

Mike Schachter mike at mindmech.com
Tue Jun 22 19:31:44 PDT 2010


I found an explanation on the forum of the Kaggle page that
explains what the non-standard letters mean, it linked to this:

http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html

   mike


On Tue, Jun 22, 2010 at 5:27 PM, Mike Schachter <mike at mindmech.com> wrote:

> Hey David,
>
> Unfortunately I don't think the sequences are amino acid sequences.
>
> For the PR sequences, most of them have a length of 297. If it's a
> DNA sequence, then this means it codes for 99 amino acids. A quick
> look shows that HIV-1 Protease (the protein whose sequence we're
> dealing with in the first sequence column) has 99 amino acid pairs:
>
> http://www.bioafrica.net/proteomics/POL-PRprot.html
>
> Does that make sense? If it does, then the sequences from the data are
> just noisy and of poor quality, and we're going to have to throw out some
> of the noisy data before running it through a sequence aligner. I'm in the
> process of doing this now, and will let everyone know how things are coming
> along at the meeting.
>
> See everyone tonight!
>
>    mike
>
>
>
>
> On Tue, Jun 22, 2010 at 8:37 AM, David Faden <dfaden at gmail.com> wrote:
>
>> It looks like the sequences are already coded in terms of amino acids
>> rather than nucleotide triples? <
>> http://www.biogem.org/Accelrys/Sequencing/symbols_amino_acids.html>
>>
>> On Mon, Jun 21, 2010 at 10:29 PM, Thomas Lotze <thomas.lotze at gmail.com>wrote:
>>
>>> I committed some python for generating base pair triplet count features,
>>> and R code for determining frequency and doing a basic GLM including the
>>> most frequent triplets.
>>> (The Noisebridge machine learning sourceforge git repository is here:
>>> https://sourceforge.net/scm/?type=git&group_id=326816  To download the
>>> files, run "git clone git://
>>> ml-noisebridge.git.sourceforge.net/gitroot/ml-noisebridge/ml-noisebridge"
>>> or, better yet, ask Mike to give you read/write access to this project so
>>> you can upload code as well)
>>>
>>> This got me to 53.8462 MCE, 36th out of 49 teams.
>>>
>>> See you tomorrow night at 9 for fun with Hadoop!
>>> -Thomas
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.noisebridge.net/pipermail/ml/attachments/20100622/0b525088/attachment.htm 


More information about the ml mailing list