This
section contains documentation and references related to PSORTb
v.3.0. When using PSORTb please cite:
PSORTb v3.0: N.Y. Yu, J.R. Wagner, M.R. Laird, G.
Melli, S. Rey, R. Lo, P. Dao, S.C.
Sahinalp, M. Ester, L.J. Foster, F.S.L. Brinkman (2010) PSORTb 3.0: Improved protein
subcellular localization prediction with refined
localization subcategories and predictive capabilities
for all prokaryotes, Bioinformatics 26(13):1608-1615
A
plain text version of the documentation is available here.
1. History
Computational prediction of the subcellular localization of proteins
is a valuable tool for genome analysis and annotation, since
a protein's subcellular localization can provide clues regarding
its function in an organism. For bacterial pathogens, the prediction
of proteins on the cell surface is of particular interest due
to the potential of such proteins to be primary drug or vaccine
targets. A protein's subcellular localization is influenced
by several features present within the protein's primary structure,
such as the presence of a signal peptide or membrane-spanning
alpha-helices.
Several algorithms have been developed
to analyze single features such as these, however the PSORT
family of programs analyzes several features at once, using
information obtained from each analysis to generate an overall
prediction of localization site. Developed by Kenta Nakai in
1991, PSORT is an algorithm which assigns a probable localization
site to a protein given an amino acid sequence alone. Originally
developed for prediction of protein localization in Gram-negative
bacteria, PSORT was expanded into a suite of programs (PSORT,
PSORT II, iPSORT) capable of handling proteins from all classes
of organisms.
The Brinkman Laboratory headed development of PSORTb, an updated
version of the PSORT algorithm with significantly higher accuracy.
PSORTb includes new analytical modules designed to
capitalize on new discoveries and observations in protein sorting,
and benefits from a training dataset of over 11600 proteins of
known localization. Its focus is on precision over recall to
facilitate accurate predictions, at the expense of not making
as many predictions as other methods may make. PSORT-B v.1.1
was released in July, 2003, with an updated version of PSORTb
v.2.0 released in 2004, and has now been succeeded by PSORTb v.3.0.
2. PSORTb v.3.0 vs. previous versions of PSORT
and PSORT-B
The original version of PSORT, still frequently
used for prediction of prokaryotic localization sites, used
a number of analyses arranged in an if/then rule-based format
to determine which of four localization sites a protein might
be resident at - cytoplasm, periplasm, inner or outer membrane
(see the documentation available at the PSORT WWW server for a full
explanation).
The PSORTb algorithm, however:
-
uses updated versions of several of these analyses, as
well as several novel analytical methods
-
utilizes a probabilistic system for determination of
a final prediction, rather than a rule-based system
-
is capable of predicting all localization sites (PSORT
I does not predict extracellular proteins)
-
does not force a prediction, returning a prediction of "Unknown"
if no prediction is made
-
displays a 28% increase in precision (% of correct predictions) relative
to PSORT I
Furthermore, PSORTb
v.2.0 offers several improvements over v.1.1:
-
prediction of Gram-positive proteins
added
-
increased coverage (more predictions are made)
-
automated flagging of proteins with potential multiple localization
sites
PSORTb v.3.0 has the
following improvements over v.2.0:
- Prediction capability for the domain of Archaea implemented
- Prediction capability for bacteria whose Gram-stains do not
reflect classical physical structures. For example, organisms that
stain Gram-negative but have no outer membrane, as well as organisms
that stain Gram-positive but have an outer membrane.
- Sub-category localization predictions added (predicts flagellar,
fimbrial, type III secretion apparatus, host-associated, and spore
localizations)
- Increased recall and coverage (more predictions are made for each
bacterial genome)
- Simplified software installation process, if local installation
is preferred over using the web. There are fewer packages to install,
since HMMTOP and its associated license are no longer required.
- Web server now allows batch sequence processing with the option
of returning results by email
- Motifs are updated and ones that are no longer 100% specific are
removed, improving software precision
Note also the change in name between PSORT-B
v.1.1 and PSORTb v.2.0 / v.3.0 - the hyphen was eliminated in order
to avoid conflicts during pattern matching or other searches.
3. PSORTb v.3.0:
Analytical Modules
PSORTb v.3.0 consists of multiple analytical
modules, each of which analyzes one biological feature known
to influence or be characteristic of subcellular localization.
The modules may act as a binary predictor, classifying a protein
as either belonging or not belonging to a particular localization
site, or they may be multi-category, able to assign a protein
to one of several localization sites. When analyzing a Gram-negative
organism (organism with two cell membranes), possible localization
sites are: cytoplasm, cytoplasmic
membrane, periplasm, outer membrane and extracellular space.
Gram-positive and archaeal localization sites include: cytoplasm, cytoplasmic
membrane, cell wall and extracellular space. The new version also offers
prediction options for specialized organisms, such as those that
stain Gram-negative but have no outer membrane or cell wall
(eg. Mycoplasma spp), as well as organisms that stain
Gram-positive but have an outer membrane (eg. Deinococcus
radiodurans). All modules are
capable of returning a negative prediction as well, such that
a protein will not be forced into one of the localization sites.
3.1 SCL-BLAST & SCL-BLASTe,
or SubCellular Localization BLAST, is a BLAST-P search against
the current local database of
proteins of known subcellular localization. An E-value
cutoff of 10e-9 is used to ensure that returned HSPs represent
true homologs, and an additional length restriction is placed
on any subject matches - the length of the query:subject HSP
must be within 80-120% of the length of the subject protein,
thus reducing potential errors associated with the domain
nature of proteins. SCL-BLAST selects the top-scoring HSP
from the list of results, and returns that protein's localization
site as its prediction, along with the name of the top-scoring
HSP and the associated E-value. SCL-BLAST is capable of assigning
a protein to any one of the possible localization sites. SCL-BLASTe
is a specialized implementation of this analysis, in which
a user's query protein is checked to see if it is an exact
match to a protein in the SCL-BLAST database. If an exact
match is found (100% similarity and within 1aa length), the
protein is immediately predicted as residing at that localization
site, and is not passed to subsequent modules.
3.2 Support Vector Machines (SVMs)
are machine learning-based classifiers trained to
classify a protein as belonging or not belonging to the set
of proteins at a specific localization site. PSORTb v.3.0
contains 13 SVMs, one for each of the localization sites (5
Gram-negative, 4 Gram-positive and 4 archaeal). Trained using frequent
sequences mined from proteins resident at a specific localization
site, each SVM will examine a query protein and determine
whether it does or does not belong at the localization site
in question. If the SVM believes to the protein to belong
to that particular site, that result is returned. Otherwise,
an unknown prediction is returned.
3.3 Motif & Profile Analysis
relies on the observation that a protein's function is closely
linked to its localization, and that several PROSITE motifs
characteristic of specific functions can be used to infer
specific localizations. Several potentially important motifs
were used to scan our current dataset, and motifs with a false-positive
rate of 0% were built into PSORTb, as were expanded versions
of the motifs termed "profiles. Note that the 0% false-positive
rate is based on the dataset of proteins of known localization
that we currently have access to. We wish to emphasize that
this does not necessarily mean that such motifs and profiles
will always be 100% accurate. If you identify an incorrect
prediction, please contact us.
A submitted protein is scanned for the occurrence of any of
these motifs or profiles, and, if found, the localization
site associated with the motif/profile is returned as the
program's prediction. Motifs associated with each of the possible localization
sites are included in PSORTb.
3.4 Outer Membrane Motif Analysis
uses motifs generated from data mining techniques
applied to a set of 425 beta-barrel proteins to classify a
query protein as outer membrane or non-outer membrane (She
et al, 2003). The A Priori algorithm was used to mine for
short motifs found more often in outer membrane proteins than
in proteins at the other four localization sites. Over 250
such motifs were identified, and a query protein is scanned
for the co-occurrence of two or more of these motifs. A prediction
of outer membrane is returned if successful.
3.5 ModHMM was derived from PRODIV-HMM (Viklund and Elofsson, 2004), a hidden Markov model-based method that identifies transmembrane alpha helices,
which in turn identifies proteins spanning the cytoplasmic
membrane. Our analyses have shown that when three or more
TMHs are predicted in a protein, there is a >95% chance of
that protein being an inner membrane protein. PSORTb uses
a modified version of PRODIV-HMM to identify
transmembrane helices and returns a prediction of cytoplasmic
membrane if 3 or more are found.
3.6 A Signal Peptide
directs a protein for export past the cytoplasmic membrane,
and thus can be further used to differentiate cytoplasmic
and non-cytoplasmic proteins. A hidden Markov model was trained
on the dataset used to train the SignalP program, and is used
to predict potential signal peptide cleavage sites. If a cleavage
site with a high probability value is not found, the first
70 amino acids of the protein are passed to a support vector
machine module trained on the same data. If the SVM is unable
to recognize a signal peptide, the protein is predicted not
to have one and is classified as cytoplasmic. However, a protein
may possess a non-traditional signal peptide, so the results
of this analysis carry less weight than do other modules when
generating a final prediction.
4. PSORTb v.3.0: Final Prediction
In order to generate a final prediction, the results of each module
are combined and assessed. A probabilistic method and 5-fold
cross validation were used to assess the likelihood of a protein
being at a specific localization given the prediction of a certain
module. These likelihoods are used to generate a probability
value for each of the five localization sites for a user's query
protein.
PSORTb v.3.0 returns a list of the five localization sites and the
associated probability value for each. We consider 7.5
to be a good cutoff above which a single localization
can be assigned, and our precision and recall values for the
program are calculated using this cutoff.
In certain cases, two localization sites may both exhibit high scores,
which may indicate a protein with domains present in neighbouring
localization sites. In cases where a localization site has a
score between 4.5 (for Gram-negative) and 5.0 (for Gram-positive)
and 7.49, the result returned to the user will say "Unknown
- This protein may have multiple localization sites". In
cases like these, we recommend you examine the long format output
of the program's prediction to draw your own conclusion.
For organisms with specialized structures, such as those who stain Gram-negative but have no cell walls, the predictor may predict a cell wall localization but the final result will say "Unknown - predicted localization does not exist". This may mean that the protein is a surface protein or that it is a false prediction. In
cases like this, we recommend you examine the long format output
of the program's prediction to draw your own conclusion.
This section of the documentation will be updated
as changes are made to the web interface. Please check back
often for up-to-date instructions on program use.
5.1 Accessing PSORTb
5.1.1 WWW Access: PSORTb is available online at http://www.psort.org. The sequence submission form
for the current version of the program is located at http://www.psort.org/psortb. The older version 2 of the program
is accessible at http://www.psort.org/psortb2/index.html.
5.1.2 Standalone PSORTb: PSORTb is also available
as a standalone program to run in a Linux environment. The
file, as well as instructions for installation, is available
at the PSORTb Downloads page.
5.2 Submitting a Sequence for Analysis on the WWW
5.2.1 Sequence Submission: The sequence submission
form can be found at http://www.psort.org/psortb/. One or more sequences
can be pasted into the text box, or the "upload from
file" option can be used to analyze a file of one or
more sequences stored on your computer. When using the text
box option, please note that a maximum of 600,000 characters
can be pasted into the box.
5.2.2 Selecting Gram Stain: PSORTb
v.3.0 performs different analyses depending on the class of
organism. You are required to choose the appropriate Gram-stain and organism domain (Bacteria or Archaea)
for your sequences. Not sure which option to select? Our Genomes page lists the classifications we used
when we analyzed sequenced genomes. If your organism is not
found there, try the NCBI Taxonomy Browser which provides a rough taxonomy for
many bacterial species which may be helpful (for example,
there is an association between proteobacteria and Gram-negative
stain properties) or see the authoritative Bergey's Manual for
Gram-stain properties for your microbe of interest.
5.2.3 Selecting Gram Stain - Advanced:
There are some organisms whose Gram stains do not accurately reflect their cellular structure. Two additional analysis options are provided for these organisms by PSORTb v.3.0 -- "Positive with outer membrane" and "Negative without outer membrane". By selecting "Advanced"
in the Gram stain option, users can choose to analyze organisms that stain
Gram-positive but also have an outer membrane, such as Deinococcus radiodurans, Mycobacterium spp,
and Veillonellaceae family of the Firmicutes phylum. The latter option allows users to analyze organisms that stain
Gram-negative but have no outer membrane, such as organisms of
the Tenericutes phylum, eg. Mycoplasma spp.
5.2.4 Receiving results: There are two ways
results from PSORTb can be received, via email or on screen display. We
recommend people use the email results method, a submission uploaded via this
method won't be limited to the 100 proteins per submission the web display mode
is. Because of processing constraints and the large demand on the service we
limit users to 50 submissions per 24 hours, however there is no practical limit
to the number of proteins per submission when using the email results method,
therefore if you have a large number of proteins to analyze, please batch them
up and use this method to process your results.
5.2.5 Acceptable Organisms: PSORTb v.3.0 accepts
protein sequences from Gram-negative and Gram-positive bacteria as well as Archaea.
All protein sequences from eukaryotic organisms
must be analyzed using a different tool. See the Resources page for possible options.
5.2.6 Acceptable Formats: PSORTb
requires that a PROTEIN sequence be submitted in FASTA format.
A sequence within a FASTA sequence file consists of three parts:
- A title line, which must begin with a `>' symbol,
and may be followed by any type of text
- A newline character at the end of the title line
- The sequence itself, which continues until the end of
file or the next `>' is reached
An example of FASTA format is shown below:
>gi|31562958|sp|Q8CWD2|BTUF_ECOL6
MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSYSDYPLQAQKIEQVSTWQGMN
LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALRQLAPWSPQPDKAEQAAQSLL
DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFKDSRVPWPQVSREQVLARSPQ
AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCNALSQVD
For more information, see the description at NCBI or contact us.
5.2.7 Whole Genome Analysis: In order to reduce the
load on the PSORTb servers, pre-calculated results for whole
bacterial genomes are available on the PSORTb site, on the
Genomes page.
5.3 Submitting a Sequence for Analysis
to Standalone PSORTb
5.3.1 Sequence File: One
or more sequences in FASTA format can be submitted to standalone
PSORTb, provided they are all contained within one file (e.g.
mysequences.txt) and are all from the same Gram class of organism.
If you have both Gram-negative and Gram-positive sequences
you wish to analyze, they must be divided into two files and
run separately.
5.3.2 Command line syntax:
Standalone PSORTb contains several options and arguments,
which are described below. The most basic command, however,
which will be sufficient for most instances, is:
$
psort [-p|-n|-a] mysequences.txt > mysequences.out
- psort
calls the PSORTb program
- -p
(Gram-positive) or -n
(Gram-negative) or -a (Archaea) tells the program which predictive model
to use
- mysequences.txt
is the name of your FASTA file containing the sequences
to be analyzed
- >
mysequences.out sends the output to a new
file that will be created called mysequences.out. If no
> is used, the output will be written to the terminal
display.
Usage:
psort [-p|-n|-a] [OPTIONS] [SEQFILE]
Runs psort on the sequence file SEQFILE . If SEQFILE isn't
provided then sequences will be read from STDIN.
--help, -h Displays usage information
--positive, -p Gram positive bacteria
--negative, -n Gram negative bacteria
--archaea, -a Archaea
--cutoff, -c Sets a cutoff value for reported results
--divergent, -d Sets a cutoff value for the multiple
localization flag
--matrix, -m Specifies the path to the pftools installation.
If
not set, defaults to the value of the PSORT_PFTOOLS
environment variable.
--format, -f Specifies sequence format (default is FASTA)
--output, -o Specifies the format for the output (default
is
'normal' Value can be one of: terse, long or normal
--root, -r Specify PSORT_ROOT for running local copies.
If
not set, defaults to the value of the PSORT_ROOT
environment variable.
--server, -s Specifies the PSORT server to use
--verbose, Be verbose while running
--version, Print the version of PSORTb used
5.3.3 Help: Typing psort
-h at the command prompt will bring up a list of available
options and usage instructions.
5.4
Understanding the Output
5.4.1
Output Formats: PSORTb
allows the user to select one of three output formats from
the sequence submission screen: Normal, Tab-delimited (terse
format) and Tab-delimited (long format). Normal output is
recommended for analysis of one or a few sequences, whereas
tab-delimited output in either format is recommended for the
analysis of a large number of sequences. The output formats
are described below. If you would like to try the examples
given below for yourself, input sequences are below:
Gram-positive
input sequence:
>SAK_BPP42
MLKRSLLFLTVLLLLFSFSSITNEVSASSSFDKGKYKKGDDASYFEPTGPYLMVNVTGVDGKRNELLSPR
YVEFPIKPGTTLTKEKIEYYVEWALDATAYKEFRVVELDPSAKIEVTYYDKNKKKEETKSFPITEKGFVV
PDLSEHIKNPGFNLITKVVIEKK
Gram-negative
input sequence:
>NP_949347.1
MQGHHFGGDMSNSEAIDNTTAKLRLAQSSSLLALALLIGSAPAQAADTDWGWLAIGAPAATAQGWTGKGV
VIGVVDTGIDFSHPALSGRAFDYNYGSFVAGSNHPHATHVAGIIGATDINRGMEGVAPDVRFSSMKIFTG
AGGSYLGDAAVADAYDGAIGSGVRIFNNSWGSSDSIANFTSREELLAHEPLLVGAFTRAVNADAVLVWST
GNDGRSQPSWQAAAPYYIQELKANWIAVTSVGENGTIASYANACGVAKAWCLAAPGGDFNPGIYSTIPGK
DYGYMSGTSMAAPYVTGATAIARQMFPKASGAQLAQIVLQTSRDIGAPGIDDVYGWGLLAVDNIVDTINP
RGAALFASAAWGRFTTLSAIGNTVLDRISDLRNGRGDVVTAPLAFAGQNGAFSQSGSNPRNAYAADLAAA
PQPSPLGFGSVWARGLAGRATLSGSASSPQTTADISGGLLGFDLVNNQNLLVGIAGGGSNTNLTASGISD
KAGAQAWHVLGYAAAMYGPAFVNVAGGWNSFDQSYQRRVIPGTAGTVFASTISAAQSSSTDVAYFFQGRG
GWTFQTEVGRIEPYVHGATRNQSFGGFSETNASIFSLSVPSASLSEAEYGAGVRWACAPIKTVDQRVAVA
PTIDLAYVRFTNDGPIQVETNLLGTSVVGQTAALGADAIRVAAGLSLTSLAGISGSFGYTGTVRDAATAH
TVSGGLSIKF
Archaeal
input sequence:
>YP_001689002.1
MFEFITDEDERGQVGIGTLIVFIAMVLVAAIAAGVLINTAGYLQSKGSATGEEASAQVSNRINIVSAYGN
VNNEKVDYVNLTVRQAAGADNINLTKSTIQWIGPDRATTLTYSSNSPSSLGENFTTESIKGSSADVLVDQ
SDRIKVIMYASGVSSNLGAGDEVQLTVTTQYGSKTTYWAQVPESLKDKNA
5.4.2 Normal
Output: The Normal output option displays
the results of each of PSORTb's analytical modules, the localization
scores for each of the 5 sites, as well as a final prediction
and associated score (if one site scores above the 7.5 cutoff).
Below are examples of both Gram-positive, Gram-negative and archaeal
output, using the input sequences given in 5.3.1. Descriptions
of the output fields can be found beneath each output example.
Gram-positive
sample output:
SeqID:
SAK_BPP42 |
Analysis
Report: |
|
CMSVM+ |
Unknown |
[No
details] |
CWSVM+ |
Unknown |
[No
details] |
CytoSVM+ |
Unknown |
[No
details] |
ECSVM+ |
Extracellular |
[No
details] |
ModHMM+ |
Unknown |
[1
internal helix found] |
Motif+ |
Unknown |
[No
motifs found] |
Profile+ |
Unknown |
[No
matches to profiles found] |
SCL-BLAST+ |
Extracellular |
[matched
134189:
Extracellular protein] |
SCL-BLASTe+ |
Unknown |
[No
matches against database] |
Signal+ |
Non-cytoplasmic |
[Signal
peptide detected] |
Localization
Scores: |
Cytoplasmic |
0.0 |
CytoplasmicMembrane |
0.0 |
Cellwall |
0.2 |
Extracellular |
9.98 |
Final
Prediction: |
Extracellular |
9.98 |
SeqID
returns whatever was found on the title line of the FASTA format
input file.
The
Analysis Report contains the results of each of PSORTb's analytical
modules. The module name is listed in the left-most column,
the centre column contains the localization site predicted
by that module (or "Unknown" if the module did not
generate a prediction), and the right-most column contains
comments related to the modules' findings. The modules in
the Gram-positive version are as follows:
- CMSVM+: The Gram-positive version of the support vector
machine trained to identify cytoplasmic membrane proteins.
Returns cytoplasmic membrane or unknown.
- CWSVM+: The support vector machine trained to identify
cell wall proteins (Gram-positive and Archaea). Returns cell wall
or unknown.
- CytoSVM+: The Gram-positive version of the support vector
machine trained to identify cytoplasmic proteins. Returns
cytoplasmic or unknown.
- ECSVM+: The Gram-positive version of the support vector
machine trained to identify extracellular proteins. Returns
extracellular or unknown.
- ModHMM+: Predicts transmembrane helices within the sequence.
The presence of 3 or more transmebrane helices causes the
module to return a prediction of cytoplasmic membrane, otherwise
unknown is returned. The Details column returns the number
of predicted helices.
- Motif+: Searches the sequence for Gram-positive
motifs indicative of a specific localization site. If
a match occurs, the localization site associated with that
motif is reported, otherwise unknown is returned. The details
column returns a link to the motif in PROSITE.
- Profile+: Searches the sequence for Gram-positive
profiles indicative of a specific localization site.
If a match occurs, the localization site associated with
that profile is reported, otherwise unknown is returned.
The details column returns a link to the profile in PROSITE.
- SCL-BLAST+: Performs a BLASTP search against the Gram-positive
subset of the current PSORTdb
dataset. If a match is found, its associated localization
site is returned and a link to that protein's record at
NCBI is provided in the Details column.
- SCL-BLASTe+: Like SCL-BLAST, but only returns a match
if the query and subject have 100% similarity and are within
1aa in length of each other. If a match is found, its associated
localization site is returned and a link to that protein's
record at NCBI is provided in the Details column.
- Signal+: Searches the sequence for the presence of a Gram-positive
cleavable N-terminal signal peptide. If a signal peptide
is detected, the module returns a prediction of non-cytoplasmic,
otherwise a result of unknown is returned.
In the Localization
Scores area, the confidence values for each of
the localization sites are given. If one of the sites has
a score of 7.5 or greater, this site and its score are returned
in the Final Prediction section. If two sites have
high scores, a flag of "This protein may have multiple
localization sites" is also returned in the Final Prediction
field.
Gram-negative
sample output (to illustrate multiple localization):
SeqID:
NP_949347.1 |
Analysis
Report: |
|
CMSVM- |
Unknown |
[No
details] |
CytoSVM- |
Unknown |
[No
details] |
ECSVM- |
Extracellular |
[No
details] |
ModHMM- |
Unknown |
[No
internal helices found] |
Motif- |
Unknown |
[No
motifs found] |
OMPMotif- |
Unknown |
[No
motifs found] |
OMSVM- |
OuterMembrane |
[No
details] |
PPSVM- |
Unknown |
[No
details] |
Profile- |
Unknown |
[No
matches to profiles found] |
SCL-BLAST- |
OuterMembrane,
Extracellular |
[matched
3646417:
Outer membrane (Autotransporter)] |
SCL-BLASTe- |
Unknown |
[No
matches against database] |
Signal- |
Non-cytoplasmic |
[Signal
peptide detected] |
Localization
Scores: |
Cytoplasmic |
0.00 |
CytoplasmicMembrane |
0.00 |
Periplasm |
0.00 |
OuterMembrane |
5.87 |
Extracellular |
4.13 |
Final
Prediction: |
Unknown
(This protein may have multiple localization sites) |
The
modules which differ between those described for the Gram-positive
version of PSORTb are listed below:
- CMSVM-: The Gram-negative version of the support vector
machine trained to identify cytoplasmic membrane proteins.
Returns cytoplasmic membrane or unknown.
- CytoSVM-: The Gram-negative version of the support vector
machine trained to identify cytoplasmic proteins. Returns
cytoplasmic or unknown.
- ECSVM-: The Gram-negative version of the support vector
machine trained to identify extracellular proteins. Returns
extracellular or unknown.
- ModHMM-: Predicts transmembrane helices within the sequence.
The presence of 3 or more transmebrane helices causes the
module to return a prediction of cytoplasmic membrane, otherwise
unknown is returned. The Details column returns the number
of predicted helices.
- Motif-: Searches the sequence for Gram-negative
motifs indicative of a specific localization site. If
a match occurs, the localization site associated with that
motif is reported, otherwise unknown is returned. The details
column returns a link to the motif in PROSITE.
- OMPMotif-: Searches the sequence for Gram-negative
outer membrane protein motifs. If a match occurs, outer
membrane is reported, otherwise unknown is returned. The
details column returns the numerical identifiers of the
motifs found.
- OMSVM-: The support vector machine trained to identify
outer membrane proteins. Returns outer membrane or unknown
(Gram-negative only).
- PPSVM-: The support vector machine trained to identify
periplasmic proteins. Returns periplasm or unknown (Gram-negative
only).
- Profile-: Searches the sequence for Gram-negative
profiles indicative of a specific localization site.
If a match occurs, the localization site associated with
that profile is reported, otherwise unknown is returned.
The details column returns a link to the profile in PROSITE.
- SCL-BLAST-: Performs a BLASTP search against the Gram-negative
subset of the current PSORTdb dataset.
If a match is found, its associated localization site is
returned and a link to that protein's record at NCBI is
provided in the Details column.
- SCL-BLASTe-: See above
- Signal-: Searches the sequence for the presence of a Gram-negative
cleavable N-terminal signal peptide. If a signal peptide
is detected, the module returns a prediction of non-cytoplasmic,
otherwise a result of unknown is returned.
For
the Gram-stain - Advanced options, the output for "Gram-positive with
outer membrane" is similar to the normal Gram-negative output (with
predictions for periplasmic and outer membrane localizations). The output
for "Gram-negative without outer membrane" option is similar to
the normal Gram-positive output, except that the cell wall localization
is not predicted in the final output, since Mycoplasma spp.
and most Tenericutes are more phylogenetically similar to
Gram-positive organisms but lack a peptidoglycan cell wall. If the
modules predict "cell wall" as a protein's localization,
the final localization will be flagged as "Unknown - predicted
localization does not exist". From what we have observed,
proteins with this prediction sometimes have a surface (cytoplasmic
membrane) localization. Users should use their own discretions for
interpreting the results of PSORTb prediction results in this case.
In the Localization Scores
area, the confidence values for each of
the localization sites are given. If one of the sites has
a score of 7.5 or greater, this site and its score are returned
in the Final Prediction section. If two sites have
high scores, a flag of "This protein may have multiple
localization sites" is also returned in the Final Prediction
field.
Archaeal sample
output (to illustrate sub-category localization detection):
SeqID: YP_001689002.1 |
Analysis
Report: |
|
CMSVM_a |
Unknown |
[No details] |
CWSVM_a |
Unknown |
[No details] |
CytoSVM_a |
Unknown |
[No details] |
ECSVM_a |
Extracellular |
[No details] |
ModHMM_a |
Unknown |
[1 internal helix found] |
Motif_a |
Unknown |
[No motifs found] |
Profile_a |
Unknown |
[No matches to profiles found] |
SCL-BLAST_a |
Extracellular |
[matched 47117675: Flagellin B1 precursor] |
Signal_a |
Non-Cytoplasmic |
[Signal
peptide detected] |
Localization
Scores: |
Cytoplasmic |
0.00 |
CytoplasmicMembrane |
0.00 |
Cellwall |
0.02 |
Extracellular |
9.98 |
Final
Prediction: |
Extracellular |
9.98 |
Secondary localization(s): |
Flagellar |
- CMSVM_a: The archaeal version of the support vector
machine trained to identify cytoplasmic membrane proteins.
Returns cytoplasmic membrane or unknown.
- CWSVM_a: The archaeal version of the support vector machine trained to identify
cell wall proteins. Returns cell wall
or unknown.
- CytoSVM_a: The archaeal version of the support vector
machine trained to identify cytoplasmic proteins. Returns
cytoplasmic or unknown.
- ECSVM_a: The archaeal version of the support vector
machine trained to identify extracellular proteins. Returns
extracellular or unknown.
- ModHMM_a: Predicts transmembrane helices within the sequence.
The presence of 3 or more transmebrane helices causes the
module to return a prediction of cytoplasmic membrane, otherwise
unknown is returned. The Details column returns the number
of predicted helices.
- Motif_a: Searches the sequence for Gram-positive
motifs indicative of a specific localization site, with the
ones not applicable to Archaea removed. If
a match occurs, the localization site associated with that
motif is reported, otherwise unknown is returned. The details
column returns a link to the motif in PROSITE.
- Profile_a: Searches the sequence for Gram-positive
profiles indicative of a specific localization site, with
the ones not applicable to Archaea removed.
If a match occurs, the localization site associated with
that profile is reported, otherwise unknown is returned.
The details column returns a link to the profile in PROSITE.
- SCL-BLAST_a: Performs a BLASTP search against the Gram-positive and archaeal
subset of the current PSORTdb
dataset. If a match is found, its associated localization
site is returned and a link to that protein's record at
NCBI is provided in the Details column.
- SCL-BLASTe_a: Like SCL-BLAST, but only returns a match
if the query and subject have 100% similarity and are within
1aa in length of each other. If a match is found, its associated
localization site is returned and a link to that protein's
record at NCBI is provided in the Details column.
- Signal_a: Searches the sequence for the presence of a Gram-positive
cleavable N-terminal signal peptide. If a signal peptide
is detected, the module returns a prediction of non-cytoplasmic,
otherwise a result of unknown is returned.
5.4.3 Tab-delimited
(Terse Format) Output: Tab-delimited
terse format output returns a list of inputted sequences,
each one on a new line, with 3 columns: SeqId contains the
information from the FASTA file definition line, Localization
contains the final prediction of localization site (or "Unknown"
is no site scored above 7.5), and Score contains the confidence
value associated with this localization site. Tab characters
occur between the columns, and, in the case of a multiple
sequence submission, each sequence record is separated by
newline characters. This format can be easily read into a
spreadsheet, using a program such as MS Excel.
5.4.4
Tab-delimited (Long Format) Output: Tab-delimited
long format output returns a list of inputted sequences, each
one on a new line, and with all of the information from the
PSORTb results placed into columns. The SeqId, module results
and comments from the analysis report, localizations and scores,
and the final prediction and score are each placed into their
own column.
5.5 Options for Retrieving Results
5.5.1
View results via the web: PSORTb prediction results are displayed
as a webpage, in the output format chosen by the user. This is the
most convenient way to view results if you are only analyzing a
few proteins.
5.5.2
Send results by email: PSORTb prediction results are sent to
user-provided email address in the output format chosen by the user.
This method is suitable for analysis of larger number of proteins
and if the output is to be transferred to another document and/or
to be used for further analyses.
PSORTb is designed to emphasize precision (or
specificity) over recall (or sensitivity), and as a result,
some classes of proteins are not predicted well. The following
issues must be considered when performing an analysis using
the current version of PSORTb:
6.1 Proteins resident at multiple localization sites:
Many proteins can exist at multiple localization sites. Examples
of such proteins include integral membrane proteins with large
periplasmic domains, or autotransporters, which contain an
outer membrane pore domain and a cleaved extracellular domain.
The current version of PSORTb handles this situation by flagging
proteins which show a distribution of localization scores
favouring two sites, rather than one. It is important to examine
the distribution of localization scores carefully in order
to determine if your submitted protein may have multiple localization
sites and if so, which two sites are involved.
6.2 Lipoproteins: The current version of PSORTb does
not detect lipoprotein motifs.
6.3 Precision vs. Recall: PSORTb is designed
to emphasize precision (or specificity) over recall (or sensitivity).
Programs which make predictions at all costs often
provide incorrect or incomplete results, which can be propagated
through annotated databases, datasets and reports in the literature.
We believe that a confident prediction is more valuable than
any prediction, and we have designed the program to this end.
Note, however, that a user may choose to use their own reduced
cutoff score in generating final predictions.
|