Difference between revisions of "Curation Guidelines"
Francislon (talk | contribs) m (Text replacement - "http://tncentral.ncc.unesp.br/cgi-bin/tn_report.pl?id=" to "https://tncentral.ncc.unesp.br/report/te/") |
|||
(2 intermediate revisions by one other user not shown) | |||
Line 19: | Line 19: | ||
[[File:Curation-fig2.png|center|thumb|600x600px|'''Figure 2.''' TnCentral Curation Guidelines.]] | [[File:Curation-fig2.png|center|thumb|600x600px|'''Figure 2.''' TnCentral Curation Guidelines.]] | ||
− | *Sometimes, a mobile element may contain other mobile elements inserted within it (“'''internal'''” mobile elements). Features within these internal elements should be associated with the internal element. For example, [[:File:Curation-fig3.png|Figure 3]] shows that transposon [ | + | *Sometimes, a mobile element may contain other mobile elements inserted within it (“'''internal'''” mobile elements). Features within these internal elements should be associated with the internal element. For example, [[:File:Curation-fig3.png|Figure 3]] shows that transposon [https://tncentral.ncc.unesp.br/report/te/Tn1546.2-AB247327 Tn''1546.2''] contains insertion sequence [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1216E IS''1216E'']. The ''tnp'' gene of [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1216E IS''1216E''] should be named ''tnp'' ([https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1216E IS''1216E'']), not ''tnp'' ([https://tncentral.ncc.unesp.br/report/te/Tn1546.2-AB247327 Tn''1546.2'']). |
[[File:Curation-fig3.png|center|thumb|820x820px|'''Figure 3.''' TnCentral Curation Guidelines.]] | [[File:Curation-fig3.png|center|thumb|820x820px|'''Figure 3.''' TnCentral Curation Guidelines.]] | ||
Line 54: | Line 54: | ||
*Select type from pull-down menu (e.g., transposon, [[General Information/What Is an IS?|insertion sequence]], integron). | *Select type from pull-down menu (e.g., transposon, [[General Information/What Is an IS?|insertion sequence]], integron). | ||
− | *Enter the name of the mobile element (e.g., [ | + | *Enter the name of the mobile element (e.g., [https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21'']) in the text box. |
*If one mobile element is interrupted by '''insertion''' of another mobile element, it will sometimes be named '''element1::element2'''. (e.g., [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1326 IS''1326'']::[https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1353 IS''1353'']; this is read as “[https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1326 IS''1326''] interrupted by [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1353 IS''1353'']”). However, this convention is not always followed. | *If one mobile element is interrupted by '''insertion''' of another mobile element, it will sometimes be named '''element1::element2'''. (e.g., [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1326 IS''1326'']::[https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1353 IS''1353'']; this is read as “[https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1326 IS''1326''] interrupted by [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1353 IS''1353'']”). However, this convention is not always followed. | ||
− | *Integrons should be named '''In_'''''<parent TE name>'', e.g., [ | + | *Integrons should be named '''In_'''''<parent TE name>'', e.g., [https://tncentral.ncc.unesp.br/report/te/In_Tn7-NC_002525 In_Tn''7''] for an integron found in [https://tncentral.ncc.unesp.br/report/te/Tn7-NC_002525 Tn''7''] unless they are given a specific name in a paper about the Tn (e.g., [https://tncentral.ncc.unesp.br/report/te/In2-AF071413 In''2''] in [https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21''], [https://tncentral.ncc.unesp.br/report/te/In4-U12338.3 In''4''] in [https://tncentral.ncc.unesp.br/report/te/Tn1696-U12338.3 Tn''1696'']). |
Line 65: | Line 65: | ||
The '''/note''' field can contain the following sub-fields. Sub-fields should be separated with semi-colons. | The '''/note''' field can contain the following sub-fields. Sub-fields should be separated with semi-colons. | ||
− | *'''Accession''' = Accession Number of the enhanced GenBank file for this mobile element. It is generally of the form Mobile Element Name-Original GenBank Accession, where the Original GenBank Accession is the GenBank Accession of the large DNA sequence (e.g., plasmid or chromosome) from which the mobile element was isolated (e.g., [ | + | *'''Accession''' = Accession Number of the enhanced GenBank file for this mobile element. It is generally of the form Mobile Element Name-Original GenBank Accession, where the Original GenBank Accession is the GenBank Accession of the large DNA sequence (e.g., plasmid or chromosome) from which the mobile element was isolated (e.g., [https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21''- AF071413]). |
− | *'''Family''' = Mobile element family (e.g., Family = [ | + | *'''Family''' = Mobile element family (e.g., Family = [https://tncentral.ncc.unesp.br/report/te/Tn3-V00613 Tn''3'']). For [[General Information/What Is an IS?|Insertion Sequences]] (IS), this information can often be found in ISfinder. |
− | *'''Group''' = Mobile element group (e.g., Group = [ | + | *'''Group''' = Mobile element group (e.g., Group = [https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21'']). For [[General Information/What Is an IS?|Insertion Sequences]], this can often be found in ISfinder. |
*'''Synonyms''' = Other names for the mobile element. | *'''Synonyms''' = Other names for the mobile element. | ||
Line 77: | Line 77: | ||
*'''Transposition''' = Enter “'''yes'''”, “'''no'''”, “'''ND'''” (not determined) depending on whether the transposition ability of the transposon has been observed. The presence of 5bp direct repeats just outside of the transposon sequence can be used as evidence of transposition. If the original GenBank file covers a larger region than just the transposon itself (e.g., a '''full plasmid sequence'''), then the presence of direct repeats can be checked. | *'''Transposition''' = Enter “'''yes'''”, “'''no'''”, “'''ND'''” (not determined) depending on whether the transposition ability of the transposon has been observed. The presence of 5bp direct repeats just outside of the transposon sequence can be used as evidence of transposition. If the original GenBank file covers a larger region than just the transposon itself (e.g., a '''full plasmid sequence'''), then the presence of direct repeats can be checked. | ||
− | *'''Other Information''' = Miscellaneous information. For [[General Information/What Is an IS?|Insertion Sequences]], indicate the [https://isfinder.biotoul.fr/ ISfinder] accession number. For minor sequence variants, indicate the percent identity to the reference sequence and include the accession number of the reference sequence (e.g., Other Information = 99% identical to reference sequence for [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS26 IS''26''] ( | + | *'''Other Information''' = Miscellaneous information. For [[General Information/What Is an IS?|Insertion Sequences]], indicate the [https://isfinder.biotoul.fr/ ISfinder] accession number. For minor sequence variants, indicate the percent identity to the reference sequence and include the accession number of the reference sequence (e.g., Other Information = 99% identical to reference sequence for [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS26 IS''26''] (X00011). If no % identity is noted, it is assumed to be 100% identical to reference (although 100% identity can be noted if desired.) Other miscellaneous information can also be included here. |
*'''Hosts''': the Hosts sub-field has several sub-sub-fields; these are separated from each other by the “|” symbol (see Examples to Copy and Paste). Organism and Molecular Source information should be taken from the Description Panel of the [https://www.snapgene.com/ SnapGene] file. Epidemiological information (e.g., Region, Country, Other LocInfo, Date Identified) can be taken '''from articles in the references'''). No Host information should be recorded for [[General Information/What Is an IS?|Insertion Sequences]]. ISfinder is considered the definitive resource for the host range of [[General Information/What Is an IS?|Insertion Sequences]]. | *'''Hosts''': the Hosts sub-field has several sub-sub-fields; these are separated from each other by the “|” symbol (see Examples to Copy and Paste). Organism and Molecular Source information should be taken from the Description Panel of the [https://www.snapgene.com/ SnapGene] file. Epidemiological information (e.g., Region, Country, Other LocInfo, Date Identified) can be taken '''from articles in the references'''). No Host information should be recorded for [[General Information/What Is an IS?|Insertion Sequences]]. ISfinder is considered the definitive resource for the host range of [[General Information/What Is an IS?|Insertion Sequences]]. | ||
Line 112: | Line 112: | ||
=====Feature Label===== | =====Feature Label===== | ||
− | Gene name (see below) followed by the Associated Element (see below) in parenthesis. For example: ''[https://www.uniprot.org/uniprot/P00392 merA]'' ([ | + | Gene name (see below) followed by the Associated Element (see below) in parenthesis. For example: ''[https://www.uniprot.org/uniprot/P00392 merA]'' ([https://tncentral.ncc.unesp.br/report/te/Tn3-V00613 Tn''3'']). |
=====Feature Color===== | =====Feature Color===== | ||
Line 145: | Line 145: | ||
*'''Associated Element''' = the name of the mobile element that the protein-coding gene is associated with. In most cases, this will be the main mobile element being curated. However, some mobile elements contain other mobile elements inside them. In these cases, some protein-coding genes might be associated with these internal mobile elements. The Feature Label (see above) is composed of the gene name and the contents of the Associated Element field in parenthesis. | *'''Associated Element''' = the name of the mobile element that the protein-coding gene is associated with. In most cases, this will be the main mobile element being curated. However, some mobile elements contain other mobile elements inside them. In these cases, some protein-coding genes might be associated with these internal mobile elements. The Feature Label (see above) is composed of the gene name and the contents of the Associated Element field in parenthesis. | ||
− | *'''Library Name''' = the unique identifier for this sequence in the '''''Custom Feature library'''''. Usually, the '''Library Name''' will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., [https://www.uniprot.org/uniprot/P00392 ''merA''] ([ | + | *'''Library Name''' = the unique identifier for this sequence in the '''''Custom Feature library'''''. Usually, the '''Library Name''' will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., [https://www.uniprot.org/uniprot/P00392 ''merA''] ([https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21''])), but that format is not mandatory. |
*'''Class''' = Major classification of gene. Possible options: Transposase, Accessory Gene, Integron Integrase, Passenger Gene, Unclassified, Hypothetical. | *'''Class''' = Major classification of gene. Possible options: Transposase, Accessory Gene, Integron Integrase, Passenger Gene, Unclassified, Hypothetical. | ||
Line 179: | Line 179: | ||
<br /> | <br /> | ||
− | *'''Label:''' name of gene followed by associated element in parentheses, e.g., [https://card.mcmaster.ca/ontology/39790 lsaE] ([ | + | *'''Label:''' name of gene followed by associated element in parentheses, e.g., [https://card.mcmaster.ca/ontology/39790 lsaE] ([https://tncentral.ncc.unesp.br/report/te/TnSsu5-KP998101.1 Tn''Ssu5'']). '''Do NOT''' include the ARO ID in this field. |
*'''/gene:''' name of gene followed by ARO accession, e.g., [https://card.mcmaster.ca/ontology/39790 lsaE] ([https://card.mcmaster.ca/ontology/39790 ARO:3003206]). Copy from Gene column of TnCentral ABR annotation file. | *'''/gene:''' name of gene followed by ARO accession, e.g., [https://card.mcmaster.ca/ontology/39790 lsaE] ([https://card.mcmaster.ca/ontology/39790 ARO:3003206]). Copy from Gene column of TnCentral ABR annotation file. | ||
Line 227: | Line 227: | ||
This category is for ORFs that have no available annotation (i.e., all good BLAST hits are to hypothetical or uncharacterized proteins, no conserved domains or family memberships): | This category is for ORFs that have no available annotation (i.e., all good BLAST hits are to hypothetical or uncharacterized proteins, no conserved domains or family memberships): | ||
− | *Color: | + | *'''Color''': [[File:Roxo.png|frameless|25x25px]] |
− | *If the ORF is < 50 amino acids and/or it overlaps with other better characterized ORFs, do not annotate it or display it in the map. Label, /gene, /product: Use the protein_id from the GenBank record (first choice), the locus_tag from the GenBank record (second choice) or the ID of the best scoring significant BLAST hit (third choice). | + | *If the ORF is < 50 amino acids and/or it overlaps with other better characterized ORFs, do not annotate it or display it in the map. |
+ | *'''Label,''' '''/gene''', '''/product''': Use the protein_id from the GenBank record (first choice), the locus_tag from the GenBank record (second choice) or the ID of the best scoring significant BLAST hit (third choice). | ||
− | *Class = Passenger Gene | + | *'''Class''' = Passenger Gene |
− | |||
− | |||
+ | *'''Subclass''' = Hypothetical | ||
====Annotation of CDS in Mobile Element Sequence Variants==== | ====Annotation of CDS in Mobile Element Sequence Variants==== | ||
− | Sometimes mobile elements are encountered that are minor sequence variants of elements that are already in the TnCentral or ISfinder database (>95% sequence identity, see section on Mobile Element Sequence Variants above). In these cases, the variant is not given a new name (i.e., the name is the same as the sequence variant). This will lead to the CDS and other features of the variant having the same names as the features of reference element even though they might not be identical in sequence to the reference features. To capture these differences, note the % identity to the reference sequence at the nucleotide level in the | + | Sometimes mobile elements are encountered that are minor sequence variants of elements that are already in the TnCentral or ISfinder database (>95% sequence identity, see section on ''Mobile Element Sequence Variants'' above). In these cases, the variant is not given a new name (i.e., the name is the same as the sequence variant). This will lead to the CDS and other features of the variant having the same names as the features of reference element even though they might not be identical in sequence to the reference features. To capture these differences, note the % identity to the reference sequence at the nucleotide level in the '''Other Information field'''. For example, if a minor sequence variant of [https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS26 IS''26''] is found in which the DNA sequence of the tnp gene is 99% identical to the reference sequence (in ISfinder), the annotation of the tnp ORF should include '''Other Information''' = 99% identical to reference sequence for tnp ([https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS26 IS''26'']) (ISfinder:X00011). If the protein product is severely altered by the sequence variation (e.g., a frameshift or point mutation leads to introduction of a premature stop codon), the feature can be given a different name (e.g., '''''tnp_p''''') and the nature of the change should be described in the '''Other Information''' field. If no % identity is noted, it is assumed to be 100% identical to the reference (although 100% identity can be noted if desired). When possible, the reference sequence of the feature should be stored in the [https://www.snapgene.com/ SnapGene] library. If the reference sequence is not readily available then a variant sequence can be stored temporarily. |
====Repeat Element Annotation==== | ====Repeat Element Annotation==== | ||
− | This section covers repeat elements such as IRL and IRR. | + | This section covers repeat elements such as '''IRL''' and '''IRR'''. |
<br /> | <br /> | ||
=====Feature Label===== | =====Feature Label===== | ||
− | These features will be named by the name of the repeat element (see below) followed by the mobile element name in parentheses (e.g., IRL (Tn21) or IRR (IS1353)) | + | These features will be named by the name of the repeat element (see below) followed by the mobile element name in parentheses (e.g., '''IRL''' ([https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21'']) or '''IRR''' ([https://tncentral.ncc.unesp.br/ISfinder/scripts/ficheIS.php?name=IS1353 IS''1353''])) |
=====Feature Type (selected from [https://www.snapgene.com/ SnapGene] Pull-Down Menu)===== | =====Feature Type (selected from [https://www.snapgene.com/ SnapGene] Pull-Down Menu)===== | ||
− | repeat_region | + | ''repeat_region'' |
=====Feature Color===== | =====Feature Color===== | ||
− | repeat elements | + | [[File:Gray.png|frameless|32x32px]] repeat elements |
=====Other Annotation (/note field)===== | =====Other Annotation (/note field)===== | ||
− | The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons. | + | The '''/note''' field can contain the following sub-fields. Sub-fields should be separated with semi-colons. |
− | *Name = name of the repeat element (e.g., IRL or IRR) | + | *'''Name''' = name of the repeat element (e.g., '''IRL''' or '''IRR''') |
− | * | + | *'''Associated Element''' = mobile element with which the repeat is associated |
− | * | + | *'''Library Name''' = the unique identifier for this sequence in the '''''Custom Feature library'''''. Usually, the Library Name will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., '''IRL''' ([https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21''])), but that format is not mandatory. |
− | * | + | *'''Other Information''' = miscellaneous information |
− | *Capture = Include Capture = yes for all features to be included in the database. | + | *'''Capture''' = Include Capture = yes for all features to be included in the database. |
<br /> | <br /> | ||
Line 270: | Line 270: | ||
=====Annotation of Repeat Elements in Mobile Element Sequence Variants===== | =====Annotation of Repeat Elements in Mobile Element Sequence Variants===== | ||
Annotation of repeat elements in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence should be noted in the Other Information field. | Annotation of repeat elements in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence should be noted in the Other Information field. | ||
− | |||
====Recombination Site Annotation==== | ====Recombination Site Annotation==== | ||
− | This section covers attC sites in Integrons and Tn3 family recombination sites (Res). Res sites also contain subsites with agreed names. Integron attC sites are disrupted by insertion of a gene cassette.Therefore, they should be annotated according to the procedures for split features (annotate the 5’-end and 3’-end as well as a split feature for display on the map). | + | This section covers ''attC'' sites in Integrons and [https://tncentral.ncc.unesp.br/report/te/Tn3-V00613 Tn''3'' family recombination sites] (Res). Res sites also contain subsites with agreed names. Integron ''attC'' sites are disrupted by insertion of a gene cassette.Therefore, they should be annotated according to the procedures for split features (annotate the 5’-end and 3’-end as well as a split feature for display on the map). |
=====Feature Label===== | =====Feature Label===== | ||
− | The label will be the name of the site (see below) followed by the mobile element name in parentheses (e.g., res_site_III (Tn21)) | + | The label will be the name of the site (see below) followed by the mobile element name in parentheses (e.g., res_site_III (''[https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn21]'')) |
=====Feature Type (selected from SnapGene Pull-Down Menu)===== | =====Feature Type (selected from SnapGene Pull-Down Menu)===== | ||
− | misc_feature | + | ''misc_feature'' |
=====Feature Color===== | =====Feature Color===== | ||
− | recombination site | + | [[File:Green.png|frameless|32x32px]] recombination site |
=====Other Annotation (/note field)===== | =====Other Annotation (/note field)===== | ||
− | The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons. | + | The '''/note''' field can contain the following sub-fields. Sub-fields should be separated with semi-colons. |
− | |||
− | |||
− | * | + | *'''Name''' = name of the recombination site (e.g., res_site_I). Integron ''attC'' sites should be named “''attC''-'''<name of downstream gene>'''” (e.g., attC-''aadA'') |
− | * | + | *'''Associated Element''' = mobile element with which the recombination sites is associated |
− | * | + | *'''Library Name''' = the unique identifier for this sequence in the Custom Feature library. Usually, the LibraryName will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., res_site_I ([https://tncentral.ncc.unesp.br/report/te/Tn21-AF071413 Tn''21''])), but that format is not mandatory. |
− | <br /> | + | *'''Capture''' = Include Capture = yes for all features to be included in the database.<br /> |
=====Annotation of Recombination Sites in Mobile Element Sequence Variants===== | =====Annotation of Recombination Sites in Mobile Element Sequence Variants===== | ||
Annotation of recombination sites in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence | Annotation of recombination sites in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence | ||
− | should be noted in the | + | should be noted in the '''Other Information''' field. <br /> |
− | <br /> | ||
====Field Structure for /note for Different Feature Types==== | ====Field Structure for /note for Different Feature Types==== | ||
− | Below is the field structure for /note for different feature types. It is not required to fill in all fields in all cases. | + | Below is the field structure for '''/note''' for different feature types. It is not required to fill in all fields in all cases. |
=====Mobile Elements===== | =====Mobile Elements===== | ||
Line 312: | Line 308: | ||
AssociatedElement = ; Class = ; Subclass = ; SequenceFamily = ; Target = ; Chemistry = ; LibraryName = ; OtherInformation = ; Capture = yesRepeat Regions and Recombination Sites Name = ; AssociatedElement = ; LibraryName = ; OtherInformation = ; Capture = yes | AssociatedElement = ; Class = ; Subclass = ; SequenceFamily = ; Target = ; Chemistry = ; LibraryName = ; OtherInformation = ; Capture = yesRepeat Regions and Recombination Sites Name = ; AssociatedElement = ; LibraryName = ; OtherInformation = ; Capture = yes | ||
+ | =====Repeat Regions and Recombination Sites===== | ||
+ | Name = ; AssociatedElement = ; LibraryName = ; OtherInformation = ; Capture = yes | ||
+ | |||
+ | <br /> | ||
+ | ==== Download ==== | ||
+ | ''The curation guidelines document is also avaliable in [https://tncentral.ncc.unesp.br/curation_guidelines.pdf PDF file format] (click to download it).'' | ||
<br /> | <br /> |
Latest revision as of 05:07, 1 April 2024
Contents
- 1 TnCentral Curation Guidelines
- 2 General Guidelines
- 3 Mobile Element Annotation
- 4 Annotation of Internal Mobile Elements
- 5 Annotation of Protein Coding Genes
- 6 Further Information About Annotation of Some Specific Types of CDS
- 7 Annotation of CDS in Mobile Element Sequence Variants
- 8 Repeat Element Annotation
- 9 Recombination Site Annotation
- 10 Field Structure for /note for Different Feature Types
- 11 Download
TnCentral Curation Guidelines
For TnCentral, we are interested in curating the following features: Protein-coding genes (e.g., transposase genes, accessory genes, passenger genes), Mobile Elements (e.g., transposons, insertion sequences, integrons), Repeat Elements (e.g., IRL, IRR), and Recombination Sites (e.g, Res and attC). These features will all be documented in the enhanced Genbank files according to the guidelines below.
General Guidelines
- In some cases, the Genbank files have additional features that we are not interested in capturing for TnCentral. Therefore, for each feature which we do want to extract, we include a field in /Note Capture = yes.
- Multiple items within any annotation field should be separated by “||”
- Label (#1 in Figure 1) refers to the text entered in the Feature text box in the SnapGene feature editing interface. Name (#2 in Figure 1) is a sub-field within the /note field for some feature types. For a given feature, Name may or may not be the same as Label.
- Annotation of split features: If a feature is interrupted by insertion of another sequence, its 5’ and 3’ ends should be annotated separately. For example, if the merA gene is interrupted, two features—merA 5’-end and merA 3’-end—should be created. If the feature is an ORF, “N/A” should be entered in the /product field. The 5’-end and 3’-end features should be captured (i.e., fill in Capture = yes in the /Note field) and saved in the SnapGene library. However, they should not be shown on the map. (Uncheck them in the feature list in the SnapGene file.) For the purpose of displaying in the map, a third feature should be created with the coordinates of the 5’ and 3’ ends shown as two disjoint regions (see Split Feature option below). This split feature should not be captured (i.e., do not fill in Capture = yes in the /Note field). Since it is not captured, no annotation needs to be provided other than the Label.
- Feature Coordinates (#3 in Figure 1) are captured automatically by SnapGene when a new feature is defined by selecting a sequence region in the SnapGene sequence or map or when a feature from the SnapGene library is detected using Detect Common Features…. If a feature is interrupted by insertion of another sequence, its coordinates can be defined as two disjoint regions using the Split Feature option (#4 in Figure 1, see example in Figure 2).
- Sometimes, a mobile element may contain other mobile elements inserted within it (“internal” mobile elements). Features within these internal elements should be associated with the internal element. For example, Figure 3 shows that transposon Tn1546.2 contains insertion sequence IS1216E. The tnp gene of IS1216E should be named tnp (IS1216E), not tnp (Tn1546.2).
Download our Snapgene curated Library
Snapgene Curated Library download link (under request).
Mobile Element Annotation
In addition to capturing features of interest within a mobile element, the mobile element will be captured as a feature itself. This section covers annotation fields for mobile element features. Transposons should be oriented with the transposase gene oriented left-to-right. This then defines IRL and IRR.
Mobile Element Sequence Variants
There are often minor sequence variants of the same mobile element. If the sequence of the variant is >95% identical at the DNA level to the reference sequence for that element, it can be considered equivalent to an already annotated element and does not need a new name. For Insertion Sequences, the reference sequence is defined by ISfinder (https://www-is.biotoul.fr/index.php). For other elements, the reference sequence is defined by TnCentral. Variant sequences can be checked for their % identity by BLASTing against ISfinder (for Insertion Sequences) or against TnCentral (for other mobile element sequences). A separate SnapGene file should not be made for the sequence variant. Related transposons that have slightly different complements of passenger genes are sometimes named TnXXX.1, TnXXX.2, etc. but this notation is not used consistently.
Note: Ideally, the SnapGene library will contain the reference sequences for Insertion Sequences (i.e., the sequence in the SnapGene library will exactly match the sequence in ISfinder for that Insertion Sequence. However, this is difficult to implement systematically. Therefore, some Insertion Sequence may be represented temporarily in the SnapGene library with a sequence that doesn’t exactly match the reference. If a closer match the reference sequence in found during subsequent annotations, the library copy should be updated with the more closely matching sequence. However, the TnCentral Accession Number should NOT be changed. In these cases, a updated separate SnapGene file should be made for the Insertion Sequence as well.
Feature Label
The label for mobile element features will simply be the mobile element name (see below).
Feature Type (selected from SnapGene Pull-Down Menu)
mobile_element
Feature Graphic
Select the graphic with no arrowheads (Figure 4).
Feature Color
mobile element
Mobile Element Type and Name (/mobile_element_type field)
- Select type from pull-down menu (e.g., transposon, insertion sequence, integron).
- Enter the name of the mobile element (e.g., Tn21) in the text box.
- If one mobile element is interrupted by insertion of another mobile element, it will sometimes be named element1::element2. (e.g., IS1326::IS1353; this is read as “IS1326 interrupted by IS1353”). However, this convention is not always followed.
- Integrons should be named In_<parent TE name>, e.g., In_Tn7 for an integron found in Tn7 unless they are given a specific name in a paper about the Tn (e.g., In2 in Tn21, In4 in Tn1696).
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons.
- Accession = Accession Number of the enhanced GenBank file for this mobile element. It is generally of the form Mobile Element Name-Original GenBank Accession, where the Original GenBank Accession is the GenBank Accession of the large DNA sequence (e.g., plasmid or chromosome) from which the mobile element was isolated (e.g., Tn21- AF071413).
- Family = Mobile element family (e.g., Family = Tn3). For Insertion Sequences (IS), this information can often be found in ISfinder.
- Group = Mobile element group (e.g., Group = Tn21). For Insertion Sequences, this can often be found in ISfinder.
- Synonyms = Other names for the mobile element.
- Partial = Enter “yes” if only a partial copy of the transposon is present.
- Transposition = Enter “yes”, “no”, “ND” (not determined) depending on whether the transposition ability of the transposon has been observed. The presence of 5bp direct repeats just outside of the transposon sequence can be used as evidence of transposition. If the original GenBank file covers a larger region than just the transposon itself (e.g., a full plasmid sequence), then the presence of direct repeats can be checked.
- Other Information = Miscellaneous information. For Insertion Sequences, indicate the ISfinder accession number. For minor sequence variants, indicate the percent identity to the reference sequence and include the accession number of the reference sequence (e.g., Other Information = 99% identical to reference sequence for IS26 (X00011). If no % identity is noted, it is assumed to be 100% identical to reference (although 100% identity can be noted if desired.) Other miscellaneous information can also be included here.
- Hosts: the Hosts sub-field has several sub-sub-fields; these are separated from each other by the “|” symbol (see Examples to Copy and Paste). Organism and Molecular Source information should be taken from the Description Panel of the SnapGene file. Epidemiological information (e.g., Region, Country, Other LocInfo, Date Identified) can be taken from articles in the references). No Host information should be recorded for Insertion Sequences. ISfinder is considered the definitive resource for the host range of Insertion Sequences.
§ Organism = Genus, species, and strain of the organism in which the element was identified;
§ Taxonomy = NCBI taxon ID of the organisms in which the element was identified;
§ BacGroup = Group (informal, not taxonomic) that the host bacterium belongs to (e.g., enterobacteria);
§ Molecular Source = Plasmid or chromosome on which the element was found;
§ Region = Geographic region where the element was identified;
§ Country = Country where the element was identified;
§ OtherLocInfo = Other information about the location where identified;
§ Date Identified = Date the mobile element was identified;
§ First = Enter “yes” if this is the first host in which the transposon was discovered;
- Capture = Include Capture = yes for all features to be included in the database
Annotation of Internal Mobile Elements
Sometimes, a mobile element (“main” mobile element) may contain other mobile elements inserted within it (“internal” mobile elements). If an internal mobile element is already in the SnapGene library and is fully annotated, no further changes are needed. If the internal mobile element has not been previously annotated (i.e., does not have a gold star in the SnapGene library), then it should be annotated according to the guidelines for annotating mobile elements above. The host information should be the same as for the main mobile element. After annotation, regions spanned by internal mobile elements should be copied to their own SnapGene files (Figure 5). The Description Panel information of the main mobile element should be copies to the new file. Regardless of their orientation in the original transposon, in the separate files, the elements should be oriented in the conventional way with the transposase gene oriented left to right, the IRL on the left and the IRR on the right. The Flip Sequence option in the View menu of SnapGene and be used to flip the orientation of the element.
Annotation of Protein Coding Genes
This section covers annotation of genes that code for proteins, including transposases genes, accessory genes (e.g., resolvases), and passenger genes (e.g., antibiotic resistance, heavy metal resistance, plant pathogen, and toxin/anti-toxin genes).
Feature Label
Gene name (see below) followed by the Associated Element (see below) in parenthesis. For example: merA (Tn3).
Feature Color
Feature Type (selected from SnapGene Pull-Down Menu)
CDS
Gene Name (/gene field)
- Use official gene symbol, when possible. By convention, gene name start with a lowercase letter (e.g., merA)
- If a gene is interrupted by insertion of another sequence, the name should be of the form geneName_disrupted (e.g., merA_disrupted).
Protein Name (/product field)
- Usually the protein name will be the same as the gene name, but starting with an uppercase letter (e.g., MerA).
- For disrupted genes, enter N/A in the /product field.
Function (/function field)
- Description of function (e.g., GO terms, UniProt keywords, Antibiotic Resistance Ontology (ARO) terms or free text). Use defined vocabularies/ontologies for annotating function whenever possible
- If terms from defined vocabularies are used, include the term identifier in parentheses after the term (e.g., antibiotic inactivation (ARO:0001004); response to mercury ion (GO:0046689).
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons. Not all sub-fields will be relevant for all genes.
- Associated Element = the name of the mobile element that the protein-coding gene is associated with. In most cases, this will be the main mobile element being curated. However, some mobile elements contain other mobile elements inside them. In these cases, some protein-coding genes might be associated with these internal mobile elements. The Feature Label (see above) is composed of the gene name and the contents of the Associated Element field in parenthesis.
- Library Name = the unique identifier for this sequence in the Custom Feature library. Usually, the Library Name will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., merA (Tn21)), but that format is not mandatory.
- Class = Major classification of gene. Possible options: Transposase, Accessory Gene, Integron Integrase, Passenger Gene, Unclassified, Hypothetical.
- Subclass = Secondary classification of gene (see examples in section Protein Coding Gene Classification).
- Target = molecular target of the gene (e.g., type of antibiotics targeted by antibiotic resistance genes or metal targeted by heavy metal resistance genes; see examples in section Protein Coding Gene Classification).
- Chemistry = mechanism of action of transposase or resolvase genes (see examples in section Protein Coding Gene Classification).
- Sequence Family = group or family to which gene belongs.
- Other Information = miscellaneous information.
- Capture = Include Capture = yes for all features to be included in the database
Further Information About Annotation of Some Specific Types of CDS
Accessory Genes
The fields that are typically filled in for accessory genes are: Label, /gene, /product, and /note.
The sub- fields within /note that are typically filled in are: Associated Element, Class, Subclass, Sequence Family, Chemistry, Library Name, and Capture. Entries for the /gene, /product, Class, Subclass, Sequence Family and Chemistry fields for accessory genes commonly occurring in TnCentral are shown in the file tncentral_accessory_gene_annotation.txt. Copy/paste from this file into the appropriate fields in SnapGene whenever possible.
Integron Integrases
There are three types of commonly occurring integron integrases (Class 1, Class 2, Class 3). Annotations for the /gene, /product, Class, Subclass, SequenceFamily, Chemistry, and Other Information fields for the three types of integrases are shown in the file tncentral_integrase_annotation.txt. Copy/paste from this file into the appropriate fields in SnapGene whenever possible.
Passenger Genes: Subclass Toxin and Antitoxin Genes
The fields that are typically filled in for toxin and antitoxin genes are: Label, /gene, /product, and /note. The sub-fields within /note that are typically filled in are: Associated Element, Class, Subclass, Sequence Family, Target, Library Name, and Capture. Entries for the /gene, /product, Subclass, Sequence Family and Target fields for toxin and antitoxin genes commonly occurring in TnCentral are shown in the file tncentral_TA_gene_annotation.txt. Copy/paste from this file into the appropriate fields in SnapGene whenever possible.
Passenger Genes: Subclass Antibiotic Resistance (ABR) Genes
ABR genes should be annotated using the Antibiotic Resistance Ontology (ARO) as implemented in the Comprehensive Antibiotic Resistance Database (CARD). BLAST the protein coding sequence of the putative ABR gene using the CARD BLAST tool. If there is a significant hit (percent identity >50% over the full length of the protein, e-value < 10 -10 ), then the ARO information for that hit should be used to annotate the ABR gene. Click on the Name or ARO tag of the hit to go to the CARD page for the gene. Information on most of the ABR genes in ARO, formatted according to TnCentral guidelines can be found in the TnCentral ABR annotation file (e.g., tncentral_abr_gene_annotation.xlsx). Whenever possible, use this file to copy/paste into the appropriate fields in SnapGene.
- Label: name of gene followed by associated element in parentheses, e.g., lsaE (TnSsu5). Do NOT include the ARO ID in this field.
- /gene: name of gene followed by ARO accession, e.g., lsaE (ARO:3003206). Copy from Gene column of TnCentral ABR annotation file.
- /product: name of gene with first letter capitalized, e.g., LsaE. Do NOT include ARO ID in this field. Copy from Product column of TnCentral ABR annotation file.
- /function: Resistance Mechanism followed by the ARO accession for the Resistance Mechanism, e.g., antibiotic target protection (ARO:0001003). Copy from Function column of TnCentral ABR annotation file.
- /note
Associated Element: as usual
Class = Passenger Gene
Subclass = Antibiotic Resistance
Sequence Family: enter the AMR Gene Family term from the CARD page followed by the ARO accession for the term, e.g., ABC-F ATP-binding cassette ribosomal protectionprotein (ARO:3004469). Copy from Sequence Family column of TnCentral ABR annotation file.
Target: enter the Drug Class terms and their ARO accessions (e.g., pleuromutilin antibiotic (ARO:3000670)||lincosamide antibiotic (ARO:0000017)||streptogramin antibiotic (ARO:0000026)|| pristinamycin IA (ARO:3000583)). Copy from Target column of TnCentral ABR annotation file. If any additional antibiotics are identified in the references for the TE, these can be included as well. The ARO accession for an antibiotic can be found by searching for the antibiotic name in CARD.
Chemistry: N/A
Library Name: as usual
Other Information: indicate whether the sequence was a perfect (100%), strict, or loose match to the reference sequence for the gene. The Detection Models section of the CARD gene page lists a bitscore cutoff for a match to be considered strict. If the match is not a perfect match, indicate the bitscore. Also, list synonyms from the Synonym(s) line of the the CARD gene page. For example, strict match to reference sequence for ARO:3000596 (bitscore: 450)||Synonyms: ermCD, ermCX.
Capture: yes.
Passenger Genes: Subclass Other
This category is for CDS that have some annotation, but do not fall into any of the specific TnCentral categories.
- Label, /gene, /product: Possibilities are to use the name provided in the Genbank file, a name from the literature about the mobile element, the name of one of the top significant BLAST hits, or a name based on a conserved domain that the protein contains, whichever seems most appropriate. Try to avoid really generic names like “orfX” or “urf1” (urf stands for “uncharacterized reading frame”.) Note that protein names should be succinct. Also, it is OK if the name describes the function of the protein, but it ideally shouldn't refer only to a structural domain. If the domain has some function associated with it, the information could be used in the name. For example, if the domain is generally associated with acetyltransferases, you could name the protein "putative acetyltransferase". However, in the absence of functional clues, "ABC-domain protein" is still probably a better name than "orfX." Importantly, before settling on the name, check to see if there is already an entry in the library with that same name. If there is, check if the sequences have full-length significant similarity and the same domain structure. If so, fine--use the same name. If the name is already used, but the sequences are unrelated, then you need to choose a different name.
- Class = Passenger Gene
- Subclass = Other
- Sequence Family: enter the name(s) and accession(s) of the conserved domain(s) or famil(ies). (e.g., DUF1010 (Pfam:PF06231)).
- Other Information = enter any additional annotation, if any
Passenger Genes: Subclass Hypothetical
This category is for ORFs that have no available annotation (i.e., all good BLAST hits are to hypothetical or uncharacterized proteins, no conserved domains or family memberships):
- If the ORF is < 50 amino acids and/or it overlaps with other better characterized ORFs, do not annotate it or display it in the map.
- Label, /gene, /product: Use the protein_id from the GenBank record (first choice), the locus_tag from the GenBank record (second choice) or the ID of the best scoring significant BLAST hit (third choice).
- Class = Passenger Gene
- Subclass = Hypothetical
Annotation of CDS in Mobile Element Sequence Variants
Sometimes mobile elements are encountered that are minor sequence variants of elements that are already in the TnCentral or ISfinder database (>95% sequence identity, see section on Mobile Element Sequence Variants above). In these cases, the variant is not given a new name (i.e., the name is the same as the sequence variant). This will lead to the CDS and other features of the variant having the same names as the features of reference element even though they might not be identical in sequence to the reference features. To capture these differences, note the % identity to the reference sequence at the nucleotide level in the Other Information field. For example, if a minor sequence variant of IS26 is found in which the DNA sequence of the tnp gene is 99% identical to the reference sequence (in ISfinder), the annotation of the tnp ORF should include Other Information = 99% identical to reference sequence for tnp (IS26) (ISfinder:X00011). If the protein product is severely altered by the sequence variation (e.g., a frameshift or point mutation leads to introduction of a premature stop codon), the feature can be given a different name (e.g., tnp_p) and the nature of the change should be described in the Other Information field. If no % identity is noted, it is assumed to be 100% identical to the reference (although 100% identity can be noted if desired). When possible, the reference sequence of the feature should be stored in the SnapGene library. If the reference sequence is not readily available then a variant sequence can be stored temporarily.
Repeat Element Annotation
This section covers repeat elements such as IRL and IRR.
Feature Label
These features will be named by the name of the repeat element (see below) followed by the mobile element name in parentheses (e.g., IRL (Tn21) or IRR (IS1353))
Feature Type (selected from SnapGene Pull-Down Menu)
repeat_region
Feature Color
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons.
- Name = name of the repeat element (e.g., IRL or IRR)
- Associated Element = mobile element with which the repeat is associated
- Library Name = the unique identifier for this sequence in the Custom Feature library. Usually, the Library Name will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., IRL (Tn21)), but that format is not mandatory.
- Other Information = miscellaneous information
- Capture = Include Capture = yes for all features to be included in the database.
Annotation of Repeat Elements in Mobile Element Sequence Variants
Annotation of repeat elements in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence should be noted in the Other Information field.
Recombination Site Annotation
This section covers attC sites in Integrons and Tn3 family recombination sites (Res). Res sites also contain subsites with agreed names. Integron attC sites are disrupted by insertion of a gene cassette.Therefore, they should be annotated according to the procedures for split features (annotate the 5’-end and 3’-end as well as a split feature for display on the map).
Feature Label
The label will be the name of the site (see below) followed by the mobile element name in parentheses (e.g., res_site_III (Tn21))
Feature Type (selected from SnapGene Pull-Down Menu)
misc_feature
Feature Color
Other Annotation (/note field)
The /note field can contain the following sub-fields. Sub-fields should be separated with semi-colons.
- Name = name of the recombination site (e.g., res_site_I). Integron attC sites should be named “attC-<name of downstream gene>” (e.g., attC-aadA)
- Associated Element = mobile element with which the recombination sites is associated
- Library Name = the unique identifier for this sequence in the Custom Feature library. Usually, the LibraryName will be based on the first example of the sequence found, so it will have a form that resembles a Feature Label (e.g., res_site_I (Tn21)), but that format is not mandatory.
- Capture = Include Capture = yes for all features to be included in the database.
Annotation of Recombination Sites in Mobile Element Sequence Variants
Annotation of recombination sites in mobile element sequence variants should be handled similarly to the annotation of CDS in sequence variants (see above). The percent identity to the reference sequence
should be noted in the Other Information field.
Field Structure for /note for Different Feature Types
Below is the field structure for /note for different feature types. It is not required to fill in all fields in all cases.
Mobile Elements
Accession = ; Family = ; Group =; Synonyms =; Partial = ; Transposition = ; OtherInformation = ; Hosts: Organism = |Taxonomy = |BacGroup = |MolecularSource = |Region = |Country = |OtherLocInfo = | |First = ; Capture = yes
CDS
AssociatedElement = ; Class = ; Subclass = ; SequenceFamily = ; Target = ; Chemistry = ; LibraryName = ; OtherInformation = ; Capture = yesRepeat Regions and Recombination Sites Name = ; AssociatedElement = ; LibraryName = ; OtherInformation = ; Capture = yes
Repeat Regions and Recombination Sites
Name = ; AssociatedElement = ; LibraryName = ; OtherInformation = ; Capture = yes
Download
The curation guidelines document is also avaliable in PDF file format (click to download it).