Skip to content

ID Mapping

UniProt provides tab-separated ID mapping files (idmapping.dat) that map UniProt accessions to identifiers in external databases. These files are available for the full database or per organism.

The parse_idmapping function streams these files and yields one IdMapping per line:

from uniprotlib import parse_idmapping

for m in parse_idmapping("idmapping.dat.gz"):
    print(m.accession, m.id_type, m.id)

Filtering by database type

Use id_type to extract only the mappings you need. This is the common case — e.g., getting all UniProt-to-NCBI-Gene mappings:

for m in parse_idmapping("idmapping.dat.gz", id_type="GeneID"):
    print(m.accession, m.id)

for m in parse_idmapping("idmapping.dat.gz", id_type="RefSeq"):
    print(m.accession, m.id)

Non-matching lines are skipped without constructing objects, so filtering is efficient even on multi-GB files.

Common ID types

id_type value Database
GeneID NCBI Gene (Entrez Gene)
RefSeq NCBI RefSeq protein
RefSeq_NT NCBI RefSeq nucleotide
EMBL EMBL/GenBank/DDBJ
EMBL-CDS EMBL CDS
PDB Protein Data Bank
GI NCBI GI number
UniRef100 UniRef 100% cluster
UniRef90 UniRef 90% cluster
UniRef50 UniRef 50% cluster
KEGG KEGG
NCBI_TaxID NCBI Taxonomy

See the UniProt ID mapping README for the full list.

Converting to a dictionary

A common pattern is building a lookup dictionary from the stream:

from collections import defaultdict
from uniprotlib import parse_idmapping

# UniProt accession -> list of Gene IDs
uniprot_to_gene = defaultdict(list)
for m in parse_idmapping("idmapping.dat.gz", id_type="GeneID"):
    uniprot_to_gene[m.accession].append(m.id)