Step 1: Download the genomic sequence of a gene with one exon (together with a total flanking sequence, right and left, of 1 kb)
For example: ATOH gene
Hints and tricks:
Step 2: Use EMBOSS getorf
to determine ORFs in the sequence
What are the default settings for minsize and maxsize?
getorf -sequence atoh1.fa -minsize 100 -outseq atoh1_orf.fa
Hints and tricks:
Count the number of ORFs
cat atoh1.fa | grep '>' | wc -l
Step 3: Use EMBOSS plotorf
to plot the location of the identified ORFs in the sequence (graphtype = png)
plotorf -sequence atoh1.fa -graph png
Step 1: Use EMBOSS syco
to calculate the codon usage on the sequence from exercise 4.1. (codon usage file for human: Ehum.cut)
ls -l /usr/share/EMBOSS/data/CODONS/*.cut
syco -sequence atoh1.fa -cfuke Ehum.cut -graph png
Step 2: Use tcode
for a combination of codon usage with periodicity scores
tcode -sequence mySeq.fa -outfile mySeq.tcode -window 200 -plot Y -graph png
Step 1: Retrieve DNA sequence of the first Mb of human chromosome 20 (genome assembly hg18) via the command-line
Hints and tricks:
wget
in the command-line with the above link to download files from a web sitegunzip
on the command-lineextractseq
to extract the first 1 Mb of the sequenceTry to get the FASTA sequence for this region also using the UCSC Genome Browser, the UCSC Table Browser or the Ensembl Genome Browser.
Step 2: Predict genes with GeneID and output in GFF file format
geneid -vP human3iso.param chr20_FirstMb.fa -G > chr20_FirstMb_geneid.gff
Hints and tricks:
locate human3iso.param
in the command-line (location: /opt/geneid/param/human3iso.param)Step 3: Upload output GFF file as a custom track in the UCSC Genome Browser and compare with the UCSC Genes track
Hints and tricks:
To specify you track name and color, change track settings in the UCSC Genome Browser (Manage tracks) or add the following to the first line of your GFF file (in a text editor):
track name='GeneID' color=255,0,0 description='User Supplied Track'
Use a 1 Mb sequence around the human URO-D gene to perform gene prediction (GeneID, Genscan) in this region. From the gene prediction output create a custom track file in BED or GFF format with the correct genomic coordinates (Hint: use Excel, see Hints and tricks). Visualize your file as a track in the UCSC Genome Browser and compare with UCSC Genes and RefSeq genes track.
Hints and tricks:
For coordinate conversion the GFF output needs to be saved as a text file. Copy the GFF output into a text editor and save as text file. Open this text file with Excel.
A text import window will pop up, continue with default settings and press Finish.
Remove lines with #: Select all data -> Data Sort (descending on column G (=start position)). Lines with # will be at the bottom of the file -> delete
Add the genomic location (minus 1) to the start position in the file. Use the fill handle in Excel (Place your mouse in the lower right corner and drag down to copy)
Add the genomic location (minus 1) to the end position in the file.
Change the name of the chromosome (first column) into the chromosome number.
Copy whole data table and paste in text editor (Notepad++) -> Save as GFF file
Coordinate conversion can also be performed through the command-line. Use the awk
statement:
Define a variable with a position in the command-line for example:
offset=44750000
cat geneid_predictions.gff | awk '{ print "chr1",$2,$3,$4+offset,$5+offset,$6,$7,$8,$9}'| tr' ' '\t'
A Python script can also be used for coordinate conversion.
Step 1: Use the gene prediction track you have uploaded in Exercise 4.4.
Step 2: Compare your gene predictions to the UCSC Genes track by using intersections at the UCSC Table Browser (at the gene level, not at nucleotide or exon level)
How many GeneID predicted genes overlap with the UCSC Genes (i.e. How many true positives)?
How many and what kind of UCSC Genes are missed by GeneID (i.e. How many false negatives)?
How many genes are predicted by GeneID that are not UCSC Genes (i.e. How many false positives)?