The following PPGP data can be downloaded from this site:
1. Raw Sequence Reads (adaptors and low-quality bases trimmed by Newbler software for 454 data or as indicated).
These datasets and builds are named according to the following system. All libraries are assumed to be transcriptome sequence unless otherwise indicated.
Libraries and Sequence datasets are given a 7-10 character name comprised of the following five parts A to E.
- A. Genus (2 letters)
- Triphysaria = Tr
- Striga = St
- Orobanche = Or
- Lindenbergia = Li
- B. species (2 letters)
- versicolor = Ve
- hermonthica = He
- aegyptiaca = Ae
- philippensis = Ph
C. Tissue stage (1-2 digits), following the PPGP tissue staging system:
- 0 - Seed germination
- 1 - Germinated seed; Radicle emerged; pre-haustorial growth
- 2 - Seedling after exposure to haustorial induction factors (HIFs)
- EARLY POST-ATTACHMENT
- 3 - Haustoria attached to host root; penetration stages, pre-vascular connection (~48 hrs)
- 4.1 - Early established parasite; parasite vegetative growth after vascular connection (~72 hrs)
- 4.2 - Spider stage
- LATE POST-ATTACHMENT
- 5.1 - Pre-emergence from soil - shoots
- 5.2 - Pre-emergence from soil - roots
- 6.1 - Vegetative structures; leaves/stems
- 6.2 - Reproductive structures; floral buds (up to anthesis)
- 6.3 - Late post-attachment - roots
- D. Sequencing Method (1 Capital letter)
- F = Roche 454 FLX [ave. 200-250 bp reads]
- T = Roche 454 Titanium [ave. 350-450 bp reads]
- R = Sanger
- G = Illumina GA2x, 81x81 bp paired-end or similar
- D = ABI SOLiD3 ca. 50 bp short reads
- E. Addenda (1 or more lower case letters or digits)
- n = normalized
- f = full length
- g = genomic sequence data
- a = amplified
- u = parasite grown unattached to host
- 2, 3, etc. = 2nd, 3rd, etc. library with this designation
Example - "OrAe51F" means this is a transcriptome library made from RNA extracted from Orobanche aegyptiaca shoots sampled prior to emergence from the soil, and sequenced using Roche 454 FLX technology.
The raw reads will be available as fasta files and an associated quality score file (.fna and .qual). For example, the reads for the above Orobanche dataset [OrAe51F] will be available as OrAe51F.fna and OrAe51F.qual.
Unigenes will be available from each assembly. The reads will be assembled for each library, with a 'combined build' comprised of multiple libraries.
The assembly process does the following: 1) identifies reads that are similar enough over a long enough stretch of sequence that they can be assumed to be portions of transcripts from the same gene, 2) combines the sequences together in a form of multiple sequence alignment, 3) produces a consensus sequence from the multiple reads that is an estimate of the DNA sequence for that portion of the transcript [if assembling ESTs] or [if assembling genomic DNA] the genome. The assembly results in collections of reads that are connected by overlapping sequence data, termed "contigs," while reads that are not part of any contig are termed singletons. The contigs plus singletons emerging from the assembly process are collectively referred to as unigenes.
3. Preliminary annotations
Annotations for each build are results from a blastx search against the inferred protein set of 10 sequenced plant genomes: five eudicots (Vitis vinifera (Vitvi1), Populus trichocarpa (Poptr1), Medicago truncatula (Medtr1; 60% complete), Carica papaya (Carpa1), and Arabidopsis thaliana (Arath7)), two grasses (Oryza sativa (Orysa5), Sorghum bicolor (Sorbi1)), and three more distantly related outgroups (Selaginella mollendorffi (Selmo1; J. Banks, pers. comm.), Physcomitrella patens (Phypa1), and Chlamydomonas reinhardtii (Chlre3)). The results of these BLAST searches were used to categorize each unigene into SuperTribes, Tribes, and Ortho-groups (and subsequent GO categories, etc.) according to the PlantTribes database (Wall et al. 2008; http://fgp.bio.psu.edu/tribedb/index.pl). These sequence-based annotations should be considered preliminary at best because: 1) they are based on incomplete, but rapidly growing, cDNA sequence information for the parasites, 2) they are based on sequence similarity to genes in Arabidopsis or other organisms that may have an annotation, not functional evidence or genomic position in the parasite itself, and 3) we do not yet have any asterid genomes (Mimulus, tobacco, tomato) in the genome database.
When you search the unigene database either by BLAST or by keywords, you are finding the set of unigenes that have significantly similar alignments to a user-supplied target sequence or to pre- calculated alignments to the given genomic database. Singletons or unigenes comprised of just a few reads are probably from genes that were little expressed in that tissue or stage. These consensus sequences will probably be short and subject to sequence error, whereas more highly expressed genes are likely to assemble into more complete contigs with more accurate sequence. As the datasets grow through the course of the project, coverage will increase and both the assemblies and the base calls will increase in overall accuracy. These issues are worth keeping in mind when considering the accuracy of pseudo-annotations.
All 454 assemblies were performed using MIRA 2.9.45 with the following commands: -project=[project_name] -job=denovo,EST,draft,454 -notraceinfo -OUT:ora=yes:ota=yes -AS:ugpf=no -CL:cpat=yes -AL:egp=yes:ms=20:mrs=75 -SK:mnr=yes)
Builds from individual libraries will have a name consisting of the sequence dataset name followed by the letter B and a digit indicating the build number. Although most assemblies will be done once, we may post additional builds for a given dataset if improved assembly software or strategies become available.
Example - "OrAe51FB1" is the first build of the sequence data produced from the Orobanche library OrAe51F.
Our goal is to produce a build for each library as soon as the data are available, and then to initiate a combined build for all of the available data from a species. Builds will be posted as they are completed. Combined builds will lag behind the individual library builds because of the size and complexity of very large assemblies. Therefore, it may be necessary to check the results from individual libraries as well as the most recent combined build in order to fully explore all of the available data.
We will typically provide full search capabilities for the most recent build for each library and the most recent combined build. Users interested in obtaining older "legacy" builds will be able to obtain them from the Downloads page.
5. Sanger datasets
StHe51R dataset (Satoko Yoshida et. al. 2010 - BMC Plant Biol and Science) was obtained from RIKEN PSC - Plant Immunity Research Group, Ken Shirasu Lab and assembly performed using MIRA 3.2.1.
TrVeR and TrPuRn datasets were obtained from John Yoder Lab and assemblies performed using MIRA 3.2.1.NOTE TO ALL USERS - SOME HOST SEQUENCES ARE INCLUDED IN PARASITE DATA
In order to maximize our capture of parasite haustorial sequences we included some host root tissues along with the endophytic parasite tissues during harvest. As a result, several libraries contain host sequences, which may be falsely interpreted as horizontal gene transfer events or otherwise cause confusion to users. The relevant hosts for each library are listed in the Data Summary tables for each species (note that combined builds will also include host sequences from individual libraries). Please be aware of this when analyzing these data.