miRNAs are noncoding RNAs that regulate gene expression at the "posttranscriptional level". They bind to messenger RNA by a specific 8-10 nucleotides sequence dropping down protein levels.
Lim et al. 2005 (http://www.nature.com/nature/journal/v433/n7027/full/nature03315.html) reported that when transfecting brain tissue-specific miR-124 to HeLa cell lines, these presented a change in their RNA expression profile. There were 174 genes being silenced or downregulated at the RNA level (genelist: Supplementary Table 2).
Looking at sequence overrepresentation in the 174 genes they discovered a recurrent motif GTGCCTT appearing at the 3'UTR of most of the downregulated genes. That is believed to be the miR-124 recognition site.
We are interested in:
How many downregulated genes contain the miR-124 binding site?
Is this number higher than expected by chance?
Have these genes been already reported as miR-124 family target genes?
Is the miR-124 binding site conserved in evolution?
Step 1: Retrieving candidate miR-124 regulatory regions
miR-124 it is known to bind to the 3'UTR of the messenger RNA. Therefore to assess to which genes it binds we need to retrieve the 3'UTR regions for the candidate target genes.
Hints and tricks:
There are some genes with "Sequence unavailable". To discard those genes just perform the following command:
grep -v Sequence infile >outfile
Step 2: Looking for miR-124 recognition site
Once we have the candidate miR-124 target sequences we need to assess how many genes do contain the miRNA recognition sequence GTGCCTT.
EMBOSS FUZZNUC tool searches for patterns in nucleotide sequences. You can perform the analysis via the web-interface or command line:
fuzznuc -sequence mir-124-down-biomart-3UTR.fasta -pattern GTGCCTT -rformat gff -pmismatch 0 -outfile mir-124-down-biomart-3UTR-mir124.gff
Is MAPK14 a miR-124 target?
How many pattern matching hits are retrieved?
How many unique genes have at least one gene?
Hints and tricks:
Counting hits command line:
cat mir-124-down-biomart-3UTR-mir124.gff | grep -v '#' | cut -f1 | sort -u | wc -l
Step 3: Assesing pattern enrichment
Having a determined sequence GTGCCTT in a gene does not imply it being a miR-124 target gene. Is this motif overrepresesnted in our set of genes?
Could it happen by chance?
To assess the overrepresentation of miR-124 in our set of genes we can use several controls:
Let's shuffle our input sequences:
shuffleseq -sequence mir-124-down-biomart-3UTR.fasta -shuffle 5 -outseq everydayimshuffling.fasta
How many hits do we have in the shuffled set of sequences?
How could you determine whether your result is statistically significant?
Bonus Step 4: Knowing TargetScan
We want to compare our results to TargetScan. TargetScan uses a more advance algorithms including conservation to predict miRNA targets Lewis et al. 2005 . Look at the sites in the UTR of a gene that you predict as a miR-124 target using fuzznuc
(e.g. MAPK14).
What is the meaning of seed region?
Do you find the same recognition site as the one in the exercise?
Is the recognition site conserved in bushbaby?