[Biojava-l] Genbank file parser error
Richard Holland
holland at eaglegenomics.com
Thu Jan 29 07:25:10 UTC 2009
Gabrielle Doan posted a solution to this a while back and I believe the
changes have been committed already:
http://www.mail-archive.com/biojava-l@lists.open-bio.org/msg01036.html
How old is the copy of BioJava that you're using? Have you tried
checking out the trunk from Subversion to see if that works?
cheers,
Richard
Mark Schreiber wrote:
> I assume that the downloaded file has the complete sequence in it? Probably
> worth checking that it has the complete sequence block (all 116366104 bp).
>
> - Mark
>
> On Thu, Jan 29, 2009 at 12:51 PM, gang wu <gwu at molbio.mgh.harvard.edu>wrote:
>
>> Hi Everyone,
>>
>> I have a piece of code to parse Genbank file and retrieve gene sequence and
>> related information. It works well with sequences such as Arabidopsis
>> thaliana, C. elegans, Bos taurus. But it failed with Mus musculus chromosome
>> 2. The contig that the code failed on is the largest one in my test. Contig
>> NT_039207 has 116366104 bp, but the code shows it's cut to 100000020 bp.
>> That causes some gene coordinates out of range. Attached is the code. Can
>> anyone give some suggesttion?
>>
>> The Mus musculus Genbank file can be downloaded at :
>> ftp://ftp.ncbi.nih.gov/genomes/M_musculus/CHR_02/mm_alt_chr2.gbk.gz
>>
>> Thanks in advance
>>
>> Gang
>> ==========================================
>> public class TestMus {
>> public void testMusChr2() throws FileNotFoundException,
>> NoSuchElementException, BioException {
>> String fp="/tmp/mm_alt_chr2.gbk";
>> System.out.println("File: " + fp);
>> BufferedReader gReader = new BufferedReader(new InputStreamReader(new
>> FileInputStream(new File(fp))));
>> Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>> RichSequenceIterator seqI =
>> RichSequence.IOTools.readGenbankDNA(gReader, ns);
>> while (seqI.hasNext()) {
>> RichSequence seq = seqI.nextRichSequence();
>> String organism = seq.getTaxon().getDisplayName();
>> String accession = seq.getAccession();
>> String identifier = seq.getIdentifier();
>> int taxonID = seq.getTaxon().getNCBITaxID();
>> String division = seq.getDivision();
>> String seqVersion = "" + seq.getSeqVersion();
>> int seqLength = seq.length();
>> String description = seq.getDescription();
>> System.out.println("Organism: " + organism
>> + "\nAccession: " + accession
>> + "\nIdentifier: " + identifier
>> + "\nTaxonID: " + taxonID
>> + "\nDivision: " + division
>> + "\nSeqVersion: " + seqVersion
>> + "\nLength: " + seqLength);
>> System.out.println("2041-2101: " + seq.subStr(2041, 2101));
>> for (Iterator i = seq.features(); i.hasNext();) {
>> RichFeature f = (RichFeature) i.next();
>> int rank = f.getRank();
>> String fType = f.getType();
>> if (fType.toLowerCase().equals("gene")) {
>> int startPos=f.getLocation().getMin();
>> int endPos=f.getLocation().getMax();
>> int geneLen=endPos-startPos+1;
>> String sequence=seq.subStr(startPos, endPos);
>> String strand = f.getStrand().getToken() + "";
>> Annotation ann = (Annotation) f.getAnnotation();
>> String geneIdentifier ="";
>> if (ann.containsProperty("locus_tag")) {
>> geneIdentifier=ann.getProperty("locus_tag") + "";
>> }
>> else geneIdentifier=ann.getProperty("gene") + "";
>>
>> String alternativeIdentifiers="";
>> try {
>> alternativeIdentifiers= (String)
>> ann.getProperty("gene");
>>
>> } catch(NoSuchElementException e) {}
>> String annotation="";
>> System.out.println(rank + "\t" + geneIdentifier + "\t" +
>> alternativeIdentifiers + "\t"
>> + startPos + "\t" + endPos + "\t" + geneLen +
>> "\t" + strand);
>> }
>> }
>> }
>> }
>> public static void main(String [] args) throws Exception {
>> TestMus tm=new TestMus();
>> tm.testMusChr2();
>> }
>> }
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
--
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the Biojava-l
mailing list