Using Raku in Bioinformatics

Reading DNA and counting bases in Raku

Read the fasta file and count bases in DNA.

Finding length of DNA

Raku handles all strings as UTF-8 by default. The chars function in a string returns the length of the string.

# Given a string, this function prints the length of the string

sub print_sequence_length(Str $seq) {
    return chars($seq)
}

put print_sequence_length("ATTTCCGCG");

# Other approach using low level nqp

sub print_sequence_length_with_nqp(Str $seq) {
    use nqp;
    return nqp::chars($seq)
}

put print_sequence_length_with_nqp("ATTTCCGCG");
# Output

9
9

Counting occurrence of each character in a string

say "nepalnepali".comb.Bag.hash
# Output

{a => 2, e => 2, i => 1, l => 2, n => 2, p => 2}

Real life example

We will read a DNA sequence given in a fasta format.

DNA is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. Alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

The two DNA strands are known as polynucleotides. Each strand is composed of many simpler monomeric units called nucleotides. Each nucleotide contains deoxyribose sugar, phosphate and nitrogen-containing nucleobase (or often simply base). Though the deoxyribose and phosphate are same in all nucleotides, the nucleobase can be any of these four: adenine [A], guanine [G], cytosine [C] and thymine [T]. The DNA sequence thus is represented with four letters A, G, C and T.

FASTA format is a text-based format for representing nucleotide sequences. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol. It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTA format is:

>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCCCTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGCAGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGGTGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGATGGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCTGTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTCTGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGGCTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGGGAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCGGATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTTCAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTGCTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG

Now we will write script in Raku to read a DNA fasta file (with extension .fna) and count the number of occurrences of A, G, C and T. This post is partially derived from here.

# Function to open and parse file
# Returns a hash

sub open_and_parse_fasta($filepath) {
    my @lines = $filepath.IO.lines;
    my %FASTA_hash; 
    my $FASTA_label = "";
    for @lines {
        my $line = $_.trim-trailing; 
        if $line.starts-with(">") {
            $FASTA_label = $line.substr(1,);
            %FASTA_hash{$FASTA_label} = ""; 
        } else {
            %FASTA_hash{$FASTA_label} = %FASTA_hash{$FASTA_label} ~ $line;
        }
    }
    return %FASTA_hash   
}



# Faster alternative of open_and_parse_fasta function
# Help sought from AlexDaniel on IRC #raku channel, code here: https://gist.github.com/AlexDaniel/de2ea5b684e3d630b803028940c80bf0


sub open_and_parse_fasta($filepath) {
    my %FASTA_hash;
    my $FASTA_label = "";
    for $filepath.IO.lines.map(*.trim-trailing) {
        if .starts-with(">") {
            $FASTA_label = .substr(1,);
        } else {
            %FASTA_hash{$FASTA_label}.push: $_;
        }
    }
    %FASTA_hash{$_} .= join for %FASTA_hash.keys;
    return %FASTA_hash
}

# Function to count base (nucleobases) occurrence and print results
# Whenever you want to count things in a collection, the rule of thumb is to use the Bag structure

sub count_bases($file, $description) {
  # run our open_and_parse_fasta function and return a FASTA_hash
  my %FASTA = open_and_parse_fasta($file);
  # select the description of the sequence we want to look at from the FASTA_hash
  my $sequence = %FASTA{$description};
  # counts and prints the bases in the genome
  $sequence.comb.Bag.hash
}

We will read this fasta file. This is the entire genome of the bacteria Candidatus Carsonella ruddii which has one of the smallest genomes found at a 159,662 base pairs of DNA and 182 protein-encoding genes. The description in this file is NC_008512.1 Candidatus Carsonella ruddii PV DNA, complete genome.

# Call the function 
say count_bases("GCF_000010365.1_ASM1036v1_genomic.fna", "NC_008512.1 Candidatus Carsonella ruddii PV DNA, complete genome");

# Output

{A => 66734, C => 13501, G => 12946, T => 66481}

Lets read another fasta file with two description lines.

  • HSBGPG Human gene for bone gla protein (BGP)
  • HSGLTH1 Human theta 1-globin gene

Running the count_bases function on this file with two different descriptions will give the count of bases for each description.

say count_bases("human_fasta.fna","HSBGPG Human gene for bone gla protein (BGP)");

say count_bases("human_fasta.fna","HSGLTH1 Human theta 1-globin gene");
# Output

{A => 217, C => 409, G => 373, T => 232}
{A => 146, C => 362, G => 357, T => 155}

Using functional programming

Base counting can also be achieved by using functional programming in Raku with map function. Lets count the bases again in functional way with same fasta file (the first one) which we have already solved above.

# Function to open and parse file

sub open_and_parse_fasta($filepath) {
    my @lines = $filepath.IO.lines;
    my %FASTA_hash; 
    my $FASTA_label = "";
    for @lines {
        my $line = $_.trim-trailing; 
        if $line.starts-with(">") {
            $FASTA_label = $line.substr(1,);
            %FASTA_hash{$FASTA_label} = ""; 
        } else {
            %FASTA_hash{$FASTA_label} = %FASTA_hash{$FASTA_label} ~ $line;
        }
    }
    return %FASTA_hash   
}


# Given a string this function returns a hash with count of each character in string
# This function has another function inside
# Uses functional programming with map function

sub count_bases_second_approach($file, $description) {
    # Read fasta file and retrieve a hash with description as key and sequence as value
    my %FASTA = open_and_parse_fasta($file);
    # Select the sequence we want to look at from the hash by providing key, the key is description of the sequence
    my $sequence = %FASTA{$description};
    my %base_hash;
    sub count_each_char($char) {
        if %base_hash{$char}:exists {
            %base_hash{$char} = %base_hash{$char} + 1
        } else {
            %base_hash{$char} = 1
        }
    }
    $sequence.comb.map(&count_each_char);
    return %base_hash
}


# Call the function

say count_bases_second_approach("GCF_000010365.1_ASM1036v1_genomic.fna", "NC_008512.1 Candidatus Carsonella ruddii PV DNA, complete genome");

# Output

{A => 66734, C => 13501, G => 12946, T => 66481}

These functions can also be implemented as methods by creating a class and putting these functions inside the class. That is left as an academic exercise to the reader.

comments powered by Disqus

Related