[leish-l] Tri-Tryp sequencing update

Chris peacock csp at sanger.ac.uk
Wed Mar 3 13:05:56 BRT 2004


Dear Colleagues,

 

On behalf of the Tri-Tryp Sequencing Consortium (Karolinska
Institute, Seattle Biomedical Research Institute, The Institute for
Genomic Research and The Wellcome Trust Sanger Institute), we are
pleased to bring you an update on the progress of the sequencing and
annotation of the genomes of Trypanosoma brucei, Trypanosoma cruzi
and Leishmania major. The gene content determination of all three
kinetoplastid genomes is now essentially complete, and closure is
ongoing for the last few gaps. All four centers remain on schedule
with plans to submit jointly four manuscripts for publication in the
summer, describing the sequence analysis for each of the genomes
along with a comparative analysis of genome content and architecture.

On February 20th, 2004, the consortium froze a snapshot of the
sequence and annotation data.  The purpose of this major data release
(version 3.0 for L. major and versions 2.0 for T. brucei and T.
cruzi) is to make available a stable and easily accessible dataset
for the scientific community at large, and provide a first set of
comprehensive analyses by the sequencing centers and community
contributors.

 


1. Status of Sequencing and Annotation


L. major:

Release version 3 of the Leishmania major genome sequence is in 36
contigs, one for each chromosome. While most of the contigs are
contiguous, a few are made up of correctly ordered and orientated
contigs with one or more sequence gaps represented by a series of 100
Ns. Manual prediction of genes has now been complemented with a
complete first pass manual annotation for the whole genome. As well
as being in GeneDB, the data are also available on the Sanger
Institute Leishmania ftp site 
(ftp://ftp.sanger.ac.uk/pub/databases/L.major_sequences)

 

T. brucei:

                Release 2 of the T. brucei genome consists of the
sequence and manual annotation of the 11 megabase chromosomes,
totalling 25.5Mb. Chromosomes I - VIII are available as single
contiguous sequences, though remaining sequencing gaps are
represented as 100 N's. Chromosome IX and X remain as four contigs
each; one large 2.8Mb contig and three smaller undordered contigs in
the case of chr IX and four ordered contigs in the case of chr X. Chr
XI has been released as 3 scaffolds (where the individual contigs
within each scaffold are ordered and gaps are spanned by pUC and/or
BAC clones) and 53 individual contigs.  Please consult the more
detailed release notes available via GeneDB for further information. 

 

T. cruzi:

This is the second release (v 2.0) of the Trypanosoma cruzi genome
sequence, gene prediction and auto-annotation generated by the
TSK-TSC. Since the previous release (v 1.0), the contig sequences and
associated annotations have *not changed*, however 40 contigs were
excluded from the dataset since they correspond to Mycoplasma
sequences. Mycoplasma is a common contaminant in eukaryotic cell
cultures and the T.cruzi DNA used to construct the libraries was
contaminated with Mycoplasma DNA. Fortunately, many Mycoplasma
species have already been sequenced, thus facilitating the
identification of the Mycoplasma contigs. This task was also made
easier because of the nucleotide composition of the Mycoplasma genome
(70-75% AT) and the high coverage of the Mycoplasma contigs. It is
important to emphasize that all the Mycoplasma sequences had
assembled separately and we have not identified any chimeric contigs
containing T.cruzi and Mycoplasma sequences. The T.cruzi release v2.0
consists of 3,999 contigs totaling 60.3 Mb, after accounting for the
40 Mycoplasma contigs that were removed (3,954 contigs in scaffolds
greater than 5 kb + 45 contigs greater than 5 kb not incorporated
into scaffolds) [Note that previous T. cruzi release v1.0 consisted
of 4,039 contigs (3,969 contigs in scaffolds greater than 5 kb + 70
contigs greater than 5 kb not incorporated into scaffolds). It is
important to note that the redundancy in the data (the T. cruzi
haploid genome size is estimated to be 40 Mb) reflects the
polymorphic nature of the genome and current efforts are now focused
on generating a dataset that separates the haplotypes, through a
combination of post-assembly data sorting and low-coverage "parental"
strain sequencing. The preliminary annotation of 25,041 gene models
reflected in this release data has been automatically generated
through a system that executes multiple steps of homology search and
then determines the best call based on multiple sources of evidence.
The next annotation release will incorporate a combination of manual
curation of gene families as well as more refined gene function
prediction based on orthology groupings with manually curated gene
products.

 


2. Data access and bulk downloading


We invite you to make GeneDB (http://www.genedb.org/) your main
source for annotation data for the three genomes and your gateway to
other annotation databases (TIGR). The sequencing centers and funding
agencies are committed to the long-term maintenance of this
centralized Kinetoplastid database and GeneDB will continue to update
genome annotation for all three genomes, integrating datasets from
other public sources and providing tools for database querying and
cross-species comparisons. As mentioned in the past, we welcome your
comments, corrections and updates to gene predictions and
annotations. You can provide feedback using the form available via
either the TIGR or GeneDB databases. Comments are automatically sent
to annotators at all the sequencing centers.  You may also contact
the database curators directly: Christiane Hertz-Fowler
(chf at sanger.ac.uk) for T. brucei, Elisabet Caler (ecaler at tigr.org)
for T. cruzi or Chris Peacock (csp at sanger.ac.uk) for L. major.

The sequence and annotation data are also available for downloading
in bulk for all three genomes (as FASTA or ARTEMIS/EMBL files) at the
TIGR ftp site in accordance with the respective sequencing centre's
data release policy for unpublished data. The directory structure and
file naming has been standardized across all three projects, and is
explained in the Excel file embedded below.

 

                Most of you have already obtained the ftp address to
the licensed T. brucei and T. cruzi ftp sites at TIGR, if you have
not, you may fill the form at 
http://www.tigr.org/tigr-scripts/license/new.pl?genre=euk, a link
will be sent to you by E-mail within minutes. For the L. major data,
you may go directly to 
ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/l_major/custom/.  The
dataset for this release and subsequent releases has been placed in
the â?~/customâ?? directories.

 

                


3. Our plans for the next few weeks


The Tri-Tryp sequencing consortium annotators and bioinformatics
analysts will convene for a week in early April to refine all gene
models and annotations based on comprehensive Tri-Tryp comparisons
(COG and synteny analyses). Even before that happens, however, we
will compute orthology groups and make them available through GeneDB.
This will allow you to examine with ease members of an orthology
group in trypanosomatids.


 


If you have any further questions please contact Bjorn Andersson
(Bjorn.Andersson at cgb.ki.se), Matt Berriman (mb at sanger.ac.uk), Najib
El-Sayed (nelsayed at tigr.org), Al Ivens (alicat at sanger.ac.uk) or Peter
Myler (mylerpj at sbri.org).

 

 

The Tri-Tryp Sequencing Consortium


-- 
Dr Christopher Peacock                  tel +44 (0)1223 494851
Senior Computer Biologist               email csp at sanger.ac.uk
Pathogen Sequencing Unit (PSU)
The Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA, UK



More information about the Leish-l mailing list