| |||||||||||
| |||||||||||
fragment of data extraction Posted by rick (rick), 14 October 2007 Hi Graham,thanks for the previous help.But now my problem is I want to extract data from a file and write it to another.However the problem I want to extract a part of the lines not the whole line.but the whole line became copied not only my desired portion. for eg; >gi|110554|avian influenza virus type and groups from these type of lines(these lines occur in the files followed by a gene sequence,and the number after gi varry)I want to extract only the >gi|110554 portion and store them to a separate file. and then make files with these gi numbers,which contained the sequences followed by the gi number. like, >gi|70981541|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA ATGGGTTCCATGCACGAGGCAGGATCCCGTCCTGCTGCGGGTGCTGATATGGACACAGATCGTGTACATC >gi|55047815|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT so the file,named >gi|55047815 will contain only the sequence next to it. I have tried many books for 2 days but really failed to do this. plz help. Posted by admin (Graham Ellis), 14 October 2007 Here's a program that reads a stream of data and puts each sequence into a separate file named after the sequence number.Code:
and my test output Code:
Hope it helps ... please do post up sample code if you have a follow up question; we're very much a forum that's set up to help with coding rather than to provide complete utilities! Posted by rick (rick), 14 October 2007 Hi Graham,thanks for the help.But this programme is not working,when I used it ,it did not create any file named after such sequence.The programme I used before to copy the >gi lines totally is: open (FH,"FASTA.fa") or die; open (FHO,">d.txt") or die; while (<FH>) { /^>gi/ and print FHO; } close FH; close FHO; But it copies the total line.plz modify it or tell me how to sparate only the gi numbers.like only >gi|33318110 not the >gi|33318110|gb|AF508640| /Avian/1(PB2)/H9N2/South Africa/1995/// Influenza A virus (A/Ostrich/South Africa/9508103/95(H9N2)) segment 1 polymerase PB2 (PB2) gene, complete cds. and then make separate files with the gi numbers which contained the sequence next to it. Thanks again for ur kind help. Posted by admin (Graham Ellis), 14 October 2007 Actions inside the loop are repeated each time the loop is run, but actions outside the loop are performed just the once. So if you want to create multiple output files, you need to have your open inside the loop for starters. And you need to have some variable in the name of the file you open - otherwise you'll simply keep overwriting the same file. You can see both of these features in my sample program, but not in your code.This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |