Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
fragment of data extraction

Posted by rick (rick), 14 October 2007
Hi Graham,thanks for the previous help.
But now my problem is I want to extract data from a file
and write it to another.However the problem I want to extract a part of the lines not the whole line.but the whole line became copied not only my desired portion.
for eg;
 >gi|110554|avian influenza virus type and groups

from these type of lines(these lines occur in the files followed by a gene sequence,and the number after gi varry)I want to extract only the >gi|110554 portion and store them to a separate file.
 and then make files with these gi numbers,which contained the sequences followed by the gi number.

like,

>gi|70981541|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
ATGGGTTCCATGCACGAGGCAGGATCCCGTCCTGCTGCGGGTGCTGATATGGACACAGATCGTGTACATC

>gi|55047815|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT

so the file,named >gi|55047815 will contain only the sequence next to it.

I have tried many books for 2 days but really failed to do this.
plz help.

       

Posted by admin (Graham Ellis), 14 October 2007
Here's a program that reads a stream of data and puts each sequence into a separate file named after the sequence number.

Code:
# Extraction of Sequence data into separate files

while (<DATA>) {
       if (/^>(.{11})/) {
               open (FHO,">$1");
       } else {
               print FHO;
       }
}

__END__
>gi|70981541|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
ATGGGTTCCATGCACGAGGCAGGATCCCGTCCTGCTGCGGGTGCTGATATGGACACAGATCGTGTACATC

>gi|55047815|ref|XM_741207.1| Aspergillus fumigatus Af293 amino acid transporter, putative (AFUA_4G01570) mRNA
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT



and my test output

Code:
earth-wind-and-fire:~/oct07/rickdem grahamellis$ ls -l
total 24
-rw-r--r--  1 grahamel  grahamel   73 14 Oct 08:42 gi|55047815
-rw-r--r--  1 grahamel  grahamel   73 14 Oct 08:42 gi|70981541
-rw-r--r--  1 grahamel  grahamel  513 14 Oct 08:43 inda
earth-wind-and-fire:~/oct07/rickdem grahamellis$ cat 'gi|55047815'
CTGAAGCTGTTTCCGATAACGAGCGTGACTTTGAGAAGCAGGACTCGAAACCAGAGTATCAGGATGCATT

earth-wind-and-fire:~/oct07/rickdem grahamellis$


Hope it helps ... please do post up sample code if you have a follow up question; we're very much a forum that's set up to help with coding rather than to provide complete utilities!

Posted by rick (rick), 14 October 2007
Hi Graham,thanks for the help.But this programme is not working,when I used it ,it did not create any file named after such sequence.

The programme I used before to copy the >gi lines totally is:

open (FH,"FASTA.fa") or die;
open (FHO,">d.txt") or die;
while (<FH>) {
  /^>gi/ and print FHO;
  }
close FH;
close FHO;


But it copies the total line.plz modify it or tell me how to sparate only the gi numbers.like only >gi|33318110
not the      >gi|33318110|gb|AF508640| /Avian/1(PB2)/H9N2/South Africa/1995/// Influenza A virus (A/Ostrich/South Africa/9508103/95(H9N2)) segment 1 polymerase PB2 (PB2) gene, complete cds.

and then make separate files with the gi numbers which contained the sequence next to it.

Thanks again for ur kind help.


Posted by admin (Graham Ellis), 14 October 2007
Actions inside the loop are repeated each time the loop is run, but actions outside the loop are performed just the once.  So if you want to create multiple output files, you need to have your open inside the loop for starters.  And you need to have some variable in the name of the file you open - otherwise you'll simply keep overwriting the same file.   You can see both of these features in my sample program, but not in your code.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho