Bioinformatics Exercises for 2nd year biomedical students 2017/2018

Part 2: Detection of CpG islands in a genomic sequence

"CpG" refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases. Regions of genomic sequences rich in the CpG pattern or "CpG islands" are resistant to methylation and tend to be associated with genes which are frequently switched on. It's been estimated that about half of all mammalian genes, and, possibly all mammalian house-keeping genes, have a CpG-rich region around their 5' end. Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups. The detection of CpG island upstream of predicted exons or genes is evidence in support of a highly expressed gene.

The Cs of most CpG dinucleotides in the human genome are methylated. Methyl-C tends to mutate to T, and so CpG dinucleotides tend to decay to TpG / CpA. This is believed to account for the fact that in bulk human DNA CpG dinucleotides occur about five times less frequently than expected (Bird, 1980, Jones et al 1992). CpG islands are unmethylated regions of the genome that are associated with the 5' ends of most house-keeping genes and many regulated genes (Bird, 1986, Larsen et al 1992). The absence of methylation slows CpG decay, and so CpG islands can be detected in DNA sequence as regions in which CpG pairs occur at close to the expected frequency. The fact that CpG islands can be detected in this way indicates that the corresponding germline DNA has been substantially hypomethylated for an extended period of time, and in fact about 80% of CpG islands are common to man and mouse (Antequera and Bird 1993).

About 56% of human genes and 47% of mouse genes are associated with CpG islands (Antequera and Bird, 1993). Often CpG islands overlap the promoter and extend about 1000 base pairs downstream into the transcription unit. Identification of potential CpG islands during sequence analysis helps to define the extreme 5' ends of genes, something that is notoriously difficult with cDNA based approaches.

The aim of this exercise is to become familiar with the detection of CpG islands in a genomic sequence. You will use the EMBOSS tools available online.
Command line version of the necessary tools are also installed on the teaching server.

Step 1: Prediction of CpG islands

We will be looking for CpG islands in a 1Mb region upstream of the gene pyruvate kinase (HGNC symbol for this gene is PKLR; the unique UCSC identifier is uc010pga.1),
an essential protein in the glycolysis pathway.
TASKS:
  1. Download a 1Mb region upstream of the gene PKLR (chr1:155,265,228-155,270,792) (Importantly: check the directionality of the gene). (Hint: Use the UCSC genome browser.)
  2. Predict CpG islands using EMBOSS tool newcpgreport or cpgplot, using default parameters.
    Try out both tools; create and show png image from cpgplot.
    Hint: cpgplot will only provide a PNG image when you specify the graphtype as png.

Step 2: Compare your predictions with known CpG islands

Your are now going to check if your predicted results are correct. Most genome browser provide a track with known CpG islands in the genome. Use this track to check if your predicted results are biologically meaningful.

The GFF file provided by cpgplot can be uploaded in the genome browser. Caveat: use the following code snippet to map the coordinates of your CpG islands to locations in the whole genome:
cat output.gff | egrep -v '^#' | awk '{ print "chr1", $2, $3, $4+155270792, $5+155270792, $6, $7, $8, $9}' | tr ' ' '\t' > track.gff
TASKS:
  1. Upload the CpG island predictions as custom track in the UCSC Genome Browser.
  2. Compare your predictions with the CpG island track in the Genome Browser.

Step 3: Python script

To practice your newly acquired python programming skills, you are going to reimplement the algorithm used by the cpgplot program to decide if a given subregion of a sequence is a CpG island. By default, cpgplot defines a CpG island as a region where the calculated (%G + %C) content is over 50% (= the GC-content of the sequence) and the calculated Observed/Expected ratio is over 0.6. In addition, a CpG island must be longer than 200 basepairs. In summary:
  1. Observed/Expected ratio > 0.6
  2. %C + %G > 50%
  3. CpG island length > 200
The Observed number of CpG patterns in a window is simply the number of times a 'C' is found followed immediately by a 'G'.
The Expected number of CpG patterns is calculated for each window as the number of CpG dinucleotides you would expect to see in that window based on the frequency of C's and G's in that window. Thus, the Expected frequency of CpG's in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length:
Expected = (number of C's * number of G's) / window length
Cpgplot uses a window of 100bp and runs along the entire sequence using this window to calculate the first two parameters for each basepair in the input sequence.
This method for the detection of CpG islands was first described in: GARDINER-GARDEN, M. AND FROMMER, M. (1987). CpG islands in vertebrate genomes. Journal of Molecular Biology 196, 261–282.
TASK:
Write a script that decides if a user supplied sequence is a CpG island or not, using the three criteria used by cpgplot. The 1Mb sequence downloaded in step one should not be used as input, instead the user of your program must be requested to provide a new DNA sequence. To simplify the exercise, your script does not need to use a sliding window, but can just calculate the three parameters for the entire input sequence.

Bram Van de Sande - LCB & KUL - 2017/2018   Bram.Vandesande (at) med.kuleuven (dot) be