Exploring DNA Sequencing and Analysis#


The preservation, retrieval, and interpretation of genetic instructions are fundamental for the existence and functionality of living cells. These instructions play a crucial role in the formation and upkeep of living organisms. It is well-established that genetic data is housed within the deoxyribonucleic acid (DNA) present in all living organisms.

DNA is a discrete, linear code ubiquitously present in most of the cells within an organism, consisting of a string of characters drawn from a set of four options.

These characters are:

A - Adenine
C - Cytosine
G - Guanine
T - Thymine

They stand for the first letters with the four nucleotides used to construct DNA.

Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things

Protein molecules dominate the behaviour of the cell serving as structural supports, chemical catalysts, molecular motors.

Central Dogma of Molecular Biology#

DNA serves as a master template for the production of all cellular proteins and other molecules, and it is responsible for controlling the cell’s growth and reproduction.

The genetic code embedded in the DNA sequence is read in groups of three nucleotides, or codons, which code for specific amino acids. These amino acids are then linked in a particular order to form a protein molecule.

Errors or mutations in the DNA sequence can significantly affect the organism, leading to genetic diseases, developmental disorders, or even cancer. However, some modifications can also provide a selective advantage and contribute to evolution by providing new traits that help an organism adapt to changing environmental conditions.

Task Description#


Task 1: Downloading DNA Strand as a Text File

  • Description: Download the DNA strand from a publicly available web-based repository of DNA sequences.

  • Subtasks:

      Identify the appropriate repository for downloading the DNA strand.
      Locate the DNA sequence of interest.
      Download the DNA sequence in the form of a text file.
    

Task 2: DNA Translation

  • Description: Write code to translate the DNA sequence to a sequence of amino acids using the genetic code.

  • Subtasks:

      Understand the genetic code and the process of DNA translation.
      Write a code that converts the DNA sequence to its corresponding amino acid sequence.
      Ensure the code works correctly for various DNA sequences.
    

Task 3: Checking the Solution

  • Description: Download the amino acid sequence and check whether it matches with the expected sequence.

  • Subtasks:

      Identify a publicly available web-based repository for downloading amino acid sequences.
      Locate the amino acid sequence corresponding to the downloaded DNA sequence.
      Download the amino acid sequence.
      Compare the translated amino acid sequence with the downloaded one to confirm the correctness of the code.
    

Two files are downloaded from the National Centre for Biotechnology Information (NCBI), including a DNA strand and its corresponding protein sequence of amino acids translated from the DNA.

import os
relative_path = './'

dna_file_path = os.path.join(relative_path,'dna_seq1.txt')
protein_file_path = os.path.join(relative_path,'protein_seq1.txt')

Subsequently, the program proceeds to read the files.

dna_file = open(dna_file_path,"r")

dna_seq = dna_file.read()
dna_seq
'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\nGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\nCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\nTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\nCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\nAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\nACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\nGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\nTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\nGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\nCCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\nTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\nGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\nTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT\nGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTC\nACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTG\nATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT'

The file under consideration contains the \n character, as a result of copying and pasting text from a web browser to a text editor. This character effects the way the string gets printed.

print(dna_seq)
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTC
ACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTG
ATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT

To replace the newline character, the .replace() method can be used with two arguments. The first argument is the substring to be replaced and the second argument is the substring to replace the first argument.

The method replaces every occurrence of the first argument with the second argument. Since strings are immutable, the replace method returns a new string which must be assigned to a new variable.

dna_seq = dna_seq.replace("\n","")
dna_seq = dna_seq.replace("\r","")
print(dna_seq)
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT

The process of translating the code into its protein sequence involves a table lookup operation.

In this case study, a dictionary object is utilized to aid in the translation process. The keys of the dictionary are strings, each consisting of three letters selected from a four-letter alphabet, while the value of each key is a string consisting of a single character, representing the corresponding amino acid.

table = {
    
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'
    
}

To verify whether the length of the DNA sequence is divisible by 3, the code must utilize the modulus operator with the value 3.

In the event that the length is not divisible by 3, the code should discard the last one or two nucleotides, since they will not correspond to a complete codon.

(len(dna_seq) % 3) == 0
False

The code snippet below iterates through the DNA sequence dna_seq starting from index 20 and icrements the index by 3 in each step, until reaching index 937 (inclusive).

For each set of three consecutive nucleotides in the DNA sequence, it looks up the corresponding amino acid using the table dictionary. If the amino acid is not a stop codon (represented by _), the amino acid is appended to the variable amino_string.

By the end of the loop, the amino_string will contain the amino acid sequences translated from the DNA sequences, excluding any stop codons.

Note The code assumes the existence of the table dictionary, which maps the codons to their corresponding amino acids. Without the definition of the dictionary, the code will raise a NameError.

amino_string = ""

for ind in range(20,938,3):
    
    codon = dna_seq[ind:ind+3]
    amino = table[codon]
    if amino != "_":
        amino_string = amino_string + amino
amino_string
'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

The following code reads in the protein sequence from the file that contains the translated amino acid sequence, removes any newline chracters \n from the sequence and prints the cleaned protein sequence.

The last line of code checks whether the variable amino_string is equal to the cleaned protein sequence protein_seq.

protein_file = open(protein_file_path,'r')

protein_seq = protein_file.read()
protein_seq = protein_seq.replace("\n","")

print(protein_seq)
MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC
amino_string == protein_seq
True

Functions#


The function read_sequence reads a text file and removes newline and carriage return characters from the string. It preprocesses the input file to prepare it for subsequent analysis.

def read_sequence(filename):
    """
    Reads a file and preprocess its content to remove newline and
    carriage return characters.

    Input
        filename (str) - the name of the file
    
    Returns
        out_string (str) - modified output string with newline and
        carriage return characters removed
    """
    
    with open(filename,'r') as file:
        string_seq = file.read()
        
        string_seq = string_seq.replace("\n","")
        string_seq = string_seq.replace("\r","")

        # close the file
        file.close()
        
        return string_seq

The lookup_table function takes a string of codons as input and returns the corresponding amino acid. It performs a table lookup to map each possible codon to the respective amino acid.

def lookup_table(input_codon):
    """
    Translate a given DNA codon into the corresponding amino acid.

    Input
        input_codon (str) - dna codon
    
    Returns
        amino (str) - translated amino-acid
    """
    
    table = {
    
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'
    
    }
        
    amino = table[input_codon]
        
    return amino
    

The translate function takes a string of DNA nucleotides as input and returns the corresponding amino acid sequence. The process involves dividing the DNA sequence into non-overlapping groups of three nucleotides (codons), which are then translated into amino acids using the lookup_table function.

In this context, the NCBI website offers the start codon index and stop codon index for a specific DNA sequence under the name CDS. Notably, the indexing for these sequences begins at 1.

Therefore, the start codon index needs to be reduced by 1 to align with the Python indexing that begins at 0.

dna_cds.png

def translate(filename):
    """
    Translate a specific DNA sequence from the given file into its
    corresponding protein sequence.

    Input
        filename (str) - name of file to be translated
    
    Returns
        amino_sequence (str) - translated protein sequence for the given file
    """
    
    dna_seq = read_sequence(filename)
    
    start = int(input("Enter starting position of codon: "))
    end = int(input("Enter stopping position of codon: "))
    print()
    
    translate_codon = dna_seq[start-1:end]
    
    amino_sequence = ""
    
    for ind in range(0,len(translate_codon),3):
        
        codon = translate_codon[ind:ind+3]
        amino = lookup_table(codon)
        
        if amino != "_":
            amino_sequence += amino
        
    return amino_sequence

To validate the functionality of the translate function, the returned value from the function is compared with the actual value of the protein sequence. Before comparison, the protein sequence undergoes modification as a result of being passed through the read_sequence function to ensure uniform preprocessing.

translate(dna_file_path) == read_sequence(protein_file_path)

True
print(translate(dna_file_path))
MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC
print(read_sequence(protein_file_path))
MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

Conclusion#


This case study presented a comprehensive approach to translating a DNA sequence into it’s corresponding protein sequence using Python.

Three key functions were developed:

  • The read_sequence function effectively preprocessed the input file, removing newline and carriage return characters to ensure accuracte and consistent data handling.

  • The lookup_table function utilized a genetic code lookup table (dictionary) to map DNA codons to their respective amino acids, enabling effective translation.

  • The translate function successfully converted the specified DNA codon range into a protein sequence by utilizing the two previous functions.

Through this case study, I gained valuable insights into the translation of genetic information, aligning the NCBI start codon index with Python’s indexing, and ensuring the accurate protein sequence generation.

The implementation of these functions offers a practical solution for DNA-to-Protein Translation, provding a valuable tool for exploring genetic information and its impact on living organisms.