change file contents to a character array

Posted by yfpeng (yfpeng), 14 September 2002

Hi guru:

I read in a text file into an array, but this array consists of every line (i.e. each line is an element of the array) in the file. Now I wish to convert this array into a character array - each character except newline character at the end of each line (use chop may cut this character) in the file  is an element of the array. how to achieve this?thanks in advance.

Posted by John_Moylan (jfp), 14 September 2002

I'm sure Graham will find holes in my code...but!

Code:

#!/usr/bin/perl -w

use strict;

undef $/;
# slurp mode.
# Alters a special variable from '\n' to ''.
# Now when you read in a file (or in this case <DATA>) its all in a scalar rather than an array
# It saves a step in this case.

my $string = <DATA>; # $string holds the lot!

my @charArray = split(//, $string);
# split with null pattern will split on everything.
# so in this case it will be an array of characters

# now loop
foreach (@charArray) {

print "..."; # add spaces just to show it works
print ; # now print the array element
}

__DATA__
This
is
some
text

Use chomp() to get rid of new lines, it safer as it will only remove new lines, chop() will remove any trailing character.
I suppose here you could use tr/\n//; or s/\n//g; to get rid of newlines from the $string.

Hope this works, and will the board administrator

point out flaws in my code please.

jfp

Posted by admin (Graham Ellis), 14 September 2002

Nope, 'jpf', not going to find holes in your code. Should work just fine ....

You talk about chop and chomp removing end of line charcters - agreed, but not in your example code - because there you're reading the whole thing as a single scalar.
$string =~ s/[\n\r]//g;
will get rid of all the new line and carriage return characters for you, and will work no matter whether you data originated on a PC, a Mac or a Unix / Linux system.

'Yfpeng', you're welcome. How did you find us? Looks to me like you might be doing some bioinformatics work - analysing DNA sequences in FastA format? For readers who haven't come across these, they're strings of C A G and T characters that can be Megabytes long, but they come in files at 72 per line!

Posted by John_Moylan (jfp), 14 September 2002

??How did you figure this boards latest member was from a bioinformatics background, just curious really as Perl seems to be "big" in this area, and I know nothing about it.

In fact, isn't there a book specifically for "Learning Perl for Bioinformatics" (erm, called that!)

Glad my stuff wasn't ripped apart, and hope it helped.

jfp

Posted by admin (Graham Ellis), 15 September 2002

Ah ... it's quite unusual to want to go through a string of chacters one by one analysing each, even though you were forced to do so in older fashioned languages such as early "C" ... until higher level functions came along. You'll very often analyse sequences of numbers one by one (e.g. to add them for an average, so see if they're increasing, etc), but it's less common with characters. However, DNA and Amino Acid sequences are often represented as letters to be analysed one by one. Here's the description of Fasta format from http://www.ncbi.nlm.nih.gov/BLAST/fasta.html
Quote:

FASTA format description
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes supported are:

A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

A alanine P proline
B aspartate or asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate or glutamine
L leucine X any
M methionine * translation stop
N asparagine - gap of indeterminate length

So, "jfp, it's an educated guess on my part that this is a Bioinformatics question - please, original posted, can you confirm or deny my guess?

By the way - in a language like Java where you hold non-changing strings as Sting objects, and strings you're building up and manipulating as Stringbuffers, you (exceprionally) should use arrays of chars for handing sqeuence data such as FASTA.

Thinks ... now can I cross-post this to the Java board

Posted by yfpeng (yfpeng), 16 September 2002

Graham is exactly right - I just started to do some bioinformatics work, I did not use Perl much before but it is certainly a very good language to process sequence data made of ACTG, etc.  

I did not have a chance to test jfp's code yet, but it should work, thanks. I found this site from Perl.org. I will ask you questions in the future for sure.

regards,
-Fred

Posted by yfpeng (yfpeng), 16 September 2002

Jfp's codes works perfectly. Graham's guess is 100% correct. I am reading a FastA format file, so I need to remove the first line in the file, because the first line in a FASTA file is a description about the sequence below, rather than the sequence itself. I can not think of a way to do this in jfp's code. If I read the file into an array (not a scalar as jfp did), I can easily get rid of the first line by setting the first element of the array to be empty.

thanks.

Posted by admin (Graham Ellis), 16 September 2002

Read in the first line using something like
$header = <FH>;
before you
undef $/;

It's often mis-stated that once $/ has been undef'd, the whole file is read when you do a <FH>. That's not quite true, as it reads all the rest of the file from where the current file pointer is - exactly what you need if the file contains a single FASTA sequence!

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.