Overview | Chapter 5: CpG

Exercises: Gene prediction

Exercise 4.1. Open Reading Frame (ORF) determination

Step 1: Download the genomic sequence of a gene with one exon (together with a total flanking sequence, right and left, of 1 kb)

For example: ATOH gene

Hints and tricks:

Download ORFs UCSC

Download ORFs UCSC

Step 2: Use EMBOSS getorf to determine ORFs in the sequence

What are the default settings for minsize and maxsize?

getorf -sequence atoh1.fa -minsize 100 -outseq atoh1_orf.fa

Hints and tricks:

Step 3: Use EMBOSS plotorf to plot the location of the identified ORFs in the sequence (graphtype = png)

plotorf -sequence atoh1.fa -graph png
plotorf

plotorf

Exercise 4.2. Coding statistics

Step 1: Use EMBOSS syco to calculate the codon usage on the sequence from exercise 4.1. (codon usage file for human: Ehum.cut)

ls -l /usr/share/EMBOSS/data/CODONS/*.cut
syco -sequence atoh1.fa -cfuke Ehum.cut -graph png
Coding statistics plot: syco

Coding statistics plot: syco

Step 2: Use tcode for a combination of codon usage with periodicity scores

tcode -sequence mySeq.fa -outfile mySeq.tcode -window 200 -plot Y -graph png
Coding statistics plot: tcode

Coding statistics plot: tcode

Exercise 4.3. GeneID

Step 1: Retrieve DNA sequence of the first Mb of human chromosome 20 (genome assembly hg18) via the command-line

Hints and tricks:

Try to get the FASTA sequence for this region also using the UCSC Genome Browser, the UCSC Table Browser or the Ensembl Genome Browser.

Step 2: Predict genes with GeneID and output in GFF file format

geneid -vP human3iso.param chr20_FirstMb.fa -G > chr20_FirstMb_geneid.gff

Hints and tricks:

Step 3: Upload output GFF file as a custom track in the UCSC Genome Browser and compare with the UCSC Genes track

Hints and tricks:

Custom track color and name in the UCSC Genome Browser

Custom track color and name in the UCSC Genome Browser

Exercise 4.4. GeneID

Use a 1 Mb sequence around the human URO-D gene to perform gene prediction (GeneID, Genscan) in this region. From the gene prediction output create a custom track file in BED or GFF format with the correct genomic coordinates (Hint: use Excel, see Hints and tricks). Visualize your file as a track in the UCSC Genome Browser and compare with UCSC Genes and RefSeq genes track.

Hints and tricks:

Exercise 4.5. Coordinate conversion

Coordinate conversion can also be performed through the command-line. Use the awk statement:

Define a variable with a position in the command-line for example:

offset=44750000

cat geneid_predictions.gff | awk '{ print "chr1",$2,$3,$4+offset,$5+offset,$6,$7,$8,$9}'| tr' ' '\t'     

A Python script can also be used for coordinate conversion.

Python script for coordinate conversion

Python script for coordinate conversion

Exercise 4.6. Gene prediction accuracy

Step 1: Use the gene prediction track you have uploaded in Exercise 4.4.

Step 2: Compare your gene predictions to the UCSC Genes track by using intersections at the UCSC Table Browser (at the gene level, not at nucleotide or exon level)

How many GeneID predicted genes overlap with the UCSC Genes (i.e. How many true positives)?

How many and what kind of UCSC Genes are missed by GeneID (i.e. How many false negatives)?

How many genes are predicted by GeneID that are not UCSC Genes (i.e. How many false positives)?

Solution Exercise 4.6. Gene prediction accuracy


Overview | Chapter 5: CpG