Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
20.9.2014 - We have just updated our course layouts and descriptions and added our 2015 schedule.

fragment of data extraction

Posted by rick (rick), 14 October 2007
Hi Graham,thanks for the previous help.
But now my problem is I want to extract data from a file
and write it to another.However the problem I want to extract a part of the lines not the whole line.but the whole line became copied not only my desired portion.
for eg;
 >gi|110554|avian influenza virus type and groups

from these type of lines(these lines occur in the files followed by a gene sequence,and the number after gi varry)I want to extract only the >gi|110554 portion and store them to a separate file.
 and then make files with these gi numbers,which contained the sequences followed by the gi number.

like,

>gi|70981541|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
ATGGGTTCCATGCACGAGGCAGGATCCCGTCCTGCTGCGGGTGCTGATATGGACACAGATCGTGTACATC

>gi|55047815|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT

so the file,named >gi|55047815 will contain only the sequence next to it.

I have tried many books for 2 days but really failed to do this.
plz help.

       

Posted by admin (Graham Ellis), 14 October 2007
Here's a program that reads a stream of data and puts each sequence into a separate file named after the sequence number.

Code:
# Extraction of Sequence data into separate files

while (<DATA>) {
       if (/^>(.{11})/) {
               open (FHO,">$1");
       } else {
               print FHO;
       }
}

__END__
>gi|70981541|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
ATGGGTTCCATGCACGAGGCAGGATCCCGTCCTGCTGCGGGTGCTGATATGGACACAGATCGTGTACATC

>gi|55047815|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT



and my test output

Code:
earth-wind-and-fire:~/oct07/rickdem grahamellis$ ls -l
total 24
-rw-r--r--  1 grahamel  grahamel   73 14 Oct 08:42 gi|55047815
-rw-r--r--  1 grahamel  grahamel   73 14 Oct 08:42 gi|70981541
-rw-r--r--  1 grahamel  grahamel  513 14 Oct 08:43 inda
earth-wind-and-fire:~/oct07/rickdem grahamellis$ cat 'gi|55047815'
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT

earth-wind-and-fire:~/oct07/rickdem grahamellis$


Hope it helps ... please do post up sample code if you have a follow up question; we're very much a forum that's set up to help with coding rather than to provide complete utilities!

Posted by rick (rick), 14 October 2007
Hi Graham,thanks for the help.But this programme is not working,when I used it ,it did not create any file named after such sequence.

The programme I used before to copy the >gi lines totally is:

open (FH,"FASTA.fa") or die;
open (FHO,">d.txt") or die;
while (<FH>) {
  /^>gi/ and print FHO;
  }
close FH;
close FHO;


But it copies the total line.plz modify it or tell me how to sparate only the gi numbers.like only >gi|33318110
not the      >gi|33318110|gb|AF508640| /Avian/1(PB2)/H9N2/South Africa/1995/// Influenza A virus (A/Ostrich/South Africa/9508103/95(H9N2)) segment 1 polymerase PB2 (PB2) gene, complete cds.

and then make separate files with the gi numbers which contained the sequence next to it.

Thanks again for ur kind help.


Posted by admin (Graham Ellis), 14 October 2007
Actions inside the loop are repeated each time the loop is run, but actions outside the loop are performed just the once.  So if you want to create multiple output files, you need to have your open inside the loop for starters.  And you need to have some variable in the name of the file you open - otherwise you'll simply keep overwriting the same file.   You can see both of these features in my sample program, but not in your code.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho