[Biopython-dev] NCBI adoption of AGP v2.0 and new qualifiers in GenBank/EMBL

Peter Cock p.j.a.cock at googlemail.com
Fri Jan 20 10:46:18 UTC 2012

Dear all,

I just spotted this via the @NCBI twitter feed,

In addition to the NCBI switch from AGP v1.1 to v2.0, the INSDC have
recently added a new feature type called "assembly_gap", and the
associated qualifiers "gap_type" and "linkage_evidence" to the INSDC
Feature Table Definitons.

Quoting from version 10.0, dated Dec 2011
> Feature Key           assembly_gap
> Definition            gap between two components of a CON record that is
> 		      part of a genome assembly;
> Mandatory qualifiers  /estimated_length=unknown or <integer>
> 		      /gap_type="TYPE"
>                       /linkage_evidence="TYPE" (Note: Mandatory only if the
>                       /gap_type is "within scaffold" or "repeat within
>                       scaffold".If there are multiple types of linkage_evidence
>                       they will appear as multiple /linkage_evidence="TYPE"
>                       qualifiers. For all other types of assembly_gap
>                       features, use of the /linkage_evidence qualifier is
>                       invalid.)
> Comment               the location span of the assembly_gap feature for an
> 		      unknown gap is 100 bp, with the 100 bp indicated as
> 		      100 "n"'s in sequence.

i.e. DDBJ, ENA & GenBank flat-files will start to use the "assembly_gap"
features to display information derived from version 2.0 AGP files from
10th Feb 2012. Probably this will affect the XML variants as well.

Unless any of the parsers/writers for GenBank or EMBL flat files use a white
list approach, the new feature key and qualifiers shouldn't cause a problem.


More information about the Biopython-dev mailing list