ID Mapping
UniProt provides tab-separated ID mapping files (idmapping.dat) that map UniProt accessions to identifiers in external databases. These files are available for the full database or per organism.
The parse_idmapping function streams these files and yields one IdMapping per line:
from uniprotlib import parse_idmapping
for m in parse_idmapping("idmapping.dat.gz"):
print(m.accession, m.id_type, m.id)
Filtering by database type
Use id_type to extract only the mappings you need. This is the common case — e.g., getting all UniProt-to-NCBI-Gene mappings:
for m in parse_idmapping("idmapping.dat.gz", id_type="GeneID"):
print(m.accession, m.id)
for m in parse_idmapping("idmapping.dat.gz", id_type="RefSeq"):
print(m.accession, m.id)
Non-matching lines are skipped without constructing objects, so filtering is efficient even on multi-GB files.
Common ID types
id_type value |
Database |
|---|---|
GeneID |
NCBI Gene (Entrez Gene) |
RefSeq |
NCBI RefSeq protein |
RefSeq_NT |
NCBI RefSeq nucleotide |
EMBL |
EMBL/GenBank/DDBJ |
EMBL-CDS |
EMBL CDS |
PDB |
Protein Data Bank |
GI |
NCBI GI number |
UniRef100 |
UniRef 100% cluster |
UniRef90 |
UniRef 90% cluster |
UniRef50 |
UniRef 50% cluster |
KEGG |
KEGG |
NCBI_TaxID |
NCBI Taxonomy |
See the UniProt ID mapping README for the full list.
Converting to a dictionary
A common pattern is building a lookup dictionary from the stream:
from collections import defaultdict
from uniprotlib import parse_idmapping
# UniProt accession -> list of Gene IDs
uniprot_to_gene = defaultdict(list)
for m in parse_idmapping("idmapping.dat.gz", id_type="GeneID"):
uniprot_to_gene[m.accession].append(m.id)