Overview | Chapter 2: Pattern matching

Exercises: Pattern matching

Exercise 8.1. miRNA target genes discovery

miRNAs are noncoding RNAs that regulate gene expression at the "posttranscriptional level". They bind to messenger RNA by a specific 8-10 nucleotides sequence dropping down protein levels.

Lim et al. 2005 (http://www.nature.com/nature/journal/v433/n7027/full/nature03315.html) reported that when transfecting brain tissue-specific miR-124 to HeLa cell lines, these presented a change in their RNA expression profile. There were 174 genes being silenced or downregulated at the RNA level (genelist: Supplementary Table 2).

Looking at sequence overrepresentation in the 174 genes they discovered a recurrent motif GTGCCTT appearing at the 3'UTR of most of the downregulated genes. That is believed to be the miR-124 recognition site.

We are interested in:

How many downregulated genes contain the miR-124 binding site?

Is this number higher than expected by chance?

Have these genes been already reported as miR-124 family target genes?

Is the miR-124 binding site conserved in evolution?

Step 1: Retrieving candidate miR-124 regulatory regions

miR-124 it is known to bind to the 3'UTR of the messenger RNA. Therefore to assess to which genes it binds we need to retrieve the 3'UTR regions for the candidate target genes.

Hints and tricks:

Step 2: Looking for miR-124 recognition site

Once we have the candidate miR-124 target sequences we need to assess how many genes do contain the miRNA recognition sequence GTGCCTT.

EMBOSS FUZZNUC tool searches for patterns in nucleotide sequences. You can perform the analysis via the web-interface or command line:

fuzznuc -sequence mir-124-down-biomart-3UTR.fasta -pattern GTGCCTT -rformat gff -pmismatch 0 -outfile mir-124-down-biomart-3UTR-mir124.gff

Is MAPK14 a miR-124 target?

How many pattern matching hits are retrieved?

How many unique genes have at least one gene?

Hints and tricks:

Step 3: Assesing pattern enrichment

Having a determined sequence GTGCCTT in a gene does not imply it being a miR-124 target gene. Is this motif overrepresesnted in our set of genes?

Could it happen by chance?

To assess the overrepresentation of miR-124 in our set of genes we can use several controls:

Let's shuffle our input sequences:

shuffleseq -sequence mir-124-down-biomart-3UTR.fasta -shuffle 5 -outseq everydayimshuffling.fasta 

How many hits do we have in the shuffled set of sequences?

How could you determine whether your result is statistically significant?

Bonus Step 4: Knowing TargetScan

We want to compare our results to TargetScan. TargetScan uses a more advance algorithms including conservation to predict miRNA targets Lewis et al. 2005 . Look at the sites in the UTR of a gene that you predict as a miR-124 target using fuzznuc (e.g. MAPK14).

What is the meaning of seed region?

Do you find the same recognition site as the one in the exercise?

Is the recognition site conserved in bushbaby?

Solution Exercise 8.1. miRNA target genes discovery


Overview | Chapter 2: Pattern matching