Release of the Medicinal Plant Consortium Transcriptome Resources

The Medicinal Plant Consortium is a NIH supported project (GM092521) consists of 13 collaborating research units from 7 institutions focused on providing transcriptomic and metabolomic resources for 14 key medicinal plants to the worldwide research community for the advancement of drug production and development.

The Principal Investigators of the MPC are Cornelius Barry1, C. Robin Buell8, Joe Chappell2, Dean DellaPenna1, Natalia Dudareva3, Mahmoud A. ElSohly4, A. Daniel Jones1, Ikhlas A. Khan4, Tom McKnight5, Basil J. Nikolau6, Sarah E. O’Connor7, Troy Smillie4, and Eve Syrkin Wurtele6

1Michigan State University, East Lansing, MI; 2University of Kentucky, Lexington, KY; 3Purdue University, West Lafayette, IN; 4University of Mississippi, Oxford, MS; 5Texas A&M University, College Station, TX; 6Iowa State University, Ames, IA; 7Massachusetts Institute of Technology, Cambridge, MA; 8University of Georgia, Athens,GA

This announcement from the MPC pertains to the release of the final assembled transcriptomes on Oct. 7, 2011. Prior released transcriptome datasets, starting on March 21, 2011, have been superseded with the final assemblies. All the final sequence read datasets have been deposited at the NCBI Sequence Read Archive (SRA) (( These are the raw sequence files and you would need special bioinformatics software to assemble the sequences into contigs. These files can found under the plant genus/species names. Assembled sequence files for each plant species are also available from the Search tools at or can simply be downloaded from the file transfer site at There are three files available per plant species. Two fasta files providing the compiled DNA sequences for the assembled loci (locus = contig) for both the final (noted as v_10072011) and the initial (v_03212011) transcriptomes, and the third file is the ESTScan prediction of the best predicted protein (open reading frame) for each contig in the final assembly. Alternative open reading frames may be possible for a contig.

The currently available BLAST server window ( provides searching capability for the final transcriptome assemblies for the following plant species with the release date noted:

  • Atropa belladonna sequence assembly (MSU v20111007)
  • Camptotheca acuminata sequence assembly (MSU v20111007)
  • Cannabis sativa sequence assembly (MSU v20111007)
  • Catharanthus roseus sequence assembly (MSU v20111007)
  • Digitalis purpurea sequence assembly (MSU v20111007)
  • Dioscorea villosa sequence assembly (MSU v20111007)
  • Echinacea purpurea sequence assembly (MSU v20111007)
  • Ginkgo biloba sequence assembly (MSU v20111007)
  • Hoodia gordonii sequence assembly (MSU v20111007)
  • Hypericum perforatum sequence assembly (MSU v20111007)
  • Panax quinquefolius sequence assembly (MSU v20111007)
  • Rauvolfia serpentina sequence assembly (MSU v20111007)
  • Rosmarinus officinalis sequence assembly (MSU v20111007)
  • Valeriana officinalis sequence assembly (MSU v20111007)

The final transcriptomes resources were derived from multiple sequencing efforts. Paired-end Illumina based sequencing of RNAs combined from multiple tissue samples (typically 5 tissues) of a single REFERENCE plant were combined with single-end reads of cDNA prepared from RNA isolated from up to 20 different tissue or tissue treatments of each plant species. The majority of the cDNA libraries were normalized to reduce the abundance of highly expressed transcripts. The cDNA libraries were then sequenced on an Illumina Genome Analyzer II, derived sequences cleaned (vector, adapter, low quality, short sequences removed), and assembled using Velvet (short read assembly program)/Oases (program using the Velvet assemblies to identify transcript isoforms).

A tblastn search of the Digitalis purpurea transcriptome with the GenBank protein sequence of accession AAB08578 identifies several matches. For example, “dpa_dpa_locus_50008_iso_1_len_1529_ver_4” is a contig or possible mRNA species (locus 50008) from Digitalis purpurea (dpa), and this is isoform 1 of this locus comprised of 1,529 bases from the final transcriptome assembly.

Note that the isoforms represent distinct or unique mRNAs as determined by Oases. These may reflect genuine isoforms but may also represent alleles, paralogs, homologs/homeologs, or errors in the DNA sequence or assembly process. The final assemblies may include errors such as gaps, insertions, deletions, frameshifts and incorrect DNA base calls. While all reasonable efforts to capture mature mRNA transcripts have been applied, mRNAs not fully processed (i.e. containing introns) and small amounts of genomic/organellar DNA may appear in these transcriptomes as well.Therefore, one should independently confirm sequence information (i.e. independent reverse transcription-PCR) of any locus identified in these assembled transciptomes.

No future refinements in the assembled transcriptome resources are anticipated at this time.

Metabolome datasets for these same 14 medicinal plants are scheduled to be released beginning on Oct. 21, 2011 and to be completed by December 15, 2011.

To assure receipt of future MPC notices of data release and announcements, please register at the Mailing List sign-up form on the project website (

Further inquiries may be directed to Joe Chappell at