how to create a two-dimensional array?

Posted by yfpeng (yfpeng), 17 September 2002

As I promised I will ask you more questions.

I guess we can use two-dimensional arrays in perl - I saw some examples but their elements are all hard-coded. I want to create an empty two-dimensional array first, then push elements into this array. The elements are paired, e.g. if I push an T into the first array (in the two-dimensional array), then it will push A into the same position of the second array, so the 2-D array looks like:

TTGC... --the first array

AACG... --the second array

It is like a matrix. how to do this in perl. thanks very much.

-Fred

Posted by admin (Graham Ellis), 17 September 2002

Perl uses "lists" rather than arrays - they can do everything an array can go in other languages (except give you problems when you run off the end), and lots more things besides ...

Amazingly, you don't have to tell Perl what's a list of lists (oh - sorry - that's what a "2 dimensional array" should be called). It works it out for itself from your code, and creates all necessary structure parts on the way. They call THAT auto-vivification

Your 2D array looks like a DNA sequence, with the first 'row' being a sequence, and the second row being the reverse complement (which no doubt you'll want to reverse later).

Here's a sample program that reads in a file containing a single FASTA sequence, splits in into a list, makes a second copy in reverse order, and switches As and Ts, Cs and Gs. It then puts it all into a list of lists, as per your question. Finally, it lists it out to "prove" that it's worked, using 2D array type notation.

Code:

open (FH,"myfasta.txt");
$header = <FH>;
undef $/;
$sequence = <FH>;
my @temp = split(/\s*/,$sequence);
my @rev = reverse(@temp);
foreach (@rev) {
tr/ATGC/TACG/;
}
$list[0] = \@temp;
$list[1] = \@rev;
for ($j=0; $j<=$#{$list[0]}; $j++) {
for ($k=0; $k<=$#list; $k++) {
print $list[$k][$j]," .. ";
}
print "\n";
}

"More than simple lists and hashes" is a huge topic and I've
just started to scrath the surface here ... think I may have to
refer you to a more substantial set of material that I can
possibly write on the board if you want to take it further.

By the way ... output from my program was

Code:

A .. G ..
C .. G ..
A .. T ..
C .. C ..
G .. T ..
C .. C ..
T .. T ..
A .. A ..
T .. T ..
A .. A ..
T .. T ..
A .. A ..
G .. G ..
A .. C ..
G .. G ..
A .. T ..
C .. G ..
C .. T ..

with the input file being

> Demo
ACACGCTA
TATAGAGACC

Posted by yfpeng (yfpeng), 17 September 2002

Thanks for your neat and excellent code...by the way, do you charge for this? FYI, I am a graduate student in bioinformatics.

But you still have two predefined arrays (sequence and reverse sequence) for your list. I want the list to be empty first, then I will go through the sequence, check if there is an A, if so I push A into the first array and T into the second array of the same column...check for the rest of As and do the same. This procedure will go for C and G too. Then I will count how many A T pairs, how many C G pairs in the sequence.

Your code has given me some ideas, but I am still thinking about how to achieve this? Thanks.

Posted by admin (Graham Ellis), 18 September 2002

You ask "do we charge" ... no. I've posted some background to what we're about onto another of the boards here (it's of more general interest that just the Perl crew:

Other Topics
Assistance
The Reasoning being this board

I'll come back to the Perl question in an hour or two - I'm just in the middle of soring out an email server that's got into some trouble ....

Posted by yfpeng (yfpeng), 18 September 2002

I was just kidding (for the excellent help you guys provide here).

Actually my last message was kind of silly, since we know A pairs T, C pairs G, why bother storing them in a list? What I wanted to do is store A, C, T, G in one row, and their occurrence in the other row, because I need to count the frequency of each one.

The list should look like:

row#1 (nucleotide) ATTCGAGTCT
row#2 (occurrence) 1111111111

then count the occurrence of A (1+1=2), etc., and sort row #2 so that I know the one with the most frequency. interesting bioinformatics question, isn't it? thanks

Posted by admin (Graham Ellis), 18 September 2002

Quote:

I was just kidding (for the excellent help you guys provide here).

I know ... but many a true word is spoken in jest. Folks come on the courses I give in the expectation that they'll not be able to ask me questions when the course is over - that's the way it works on other courses - and just wonder what my motivation is in providing email / board help.

Thanks for giving me (even in jest!) the opportunity to answer that! Now ... on to the Bioinformatics question. Separate message, Ithink?

Posted by admin (Graham Ellis), 18 September 2002

I might not be understanding the current question properly here - but I think it boils down to "how many each of C A G and T are there in a list of single characters?

Let's see

Code:

# @dna contains the incoming string
foreach (@dna) {
$counter{$_}++;
}
foreach ("A","C","G","T") {
print ;
print " $counter{$_}\n";
}

Just dashed that out - so there may be a couple of typos. I've used a hash with the keys being A C G and T to be my table of counters, then I've just stepped through the list in @dna.

Note the heavy use of $_ (did you know that print with no parameters prints the contents of $_?). When Perl 6 comes along, $_ will be even more powerful and important - expect to hear a lot about "Topicalkization"

Posted by yfpeng (yfpeng), 18 September 2002

I did not phrase the problem clearly. Yes, we can use a counter for each character, but then I need to sort A, C, T, G by their occurrence. That is why I was thinking about using a list - the first row is each A, C, T, G (each one can occur multiple times in the sequence), the second row is occurrence (when I find one A in the sequence, I put 1 in the second row of the same column), then sort the first row in descending order according to the second row (ocurrence or frequency). I do not know if we can do this in Perl - but I guess it can, since I found can do a lot of "strange" things from a beginner's point of view.

Hopefully the problem is a little more clear.

Posted by yfpeng (yfpeng), 18 September 2002

It probably make more sense to use hash in this case, I am trying...

Posted by admin (Graham Ellis), 19 September 2002

I'm afraid I'm still not really clear on what the lists(s) or hashes should look like in the end - I'm just a programmer and know nothing of bioinformatics . Earlier in the thread you wrote

Quote:

The list should look like:

row#1 (nucleotide) ATTCGAGTCT
row#2 (occurrence) 1111111111

then count the occurrence of A (1+1=2), etc., and sort row #2 so that I know the one with the most frequency. interesting bioinformatics question, isn't it? thanks

If you could tell me what the result would look like in this example, I might be better able to follow that I can at the moment

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.