Skip to content

API Reference

Parsers

parse_xml

uniprotlib.parse_xml(*paths)

Stream-parse one or more UniProt XML files, yielding UniProtEntry objects.

Accepts plain XML or gzip-compressed files (auto-detected from .gz extension). Handles both namespace variants (http:// for single-entry web downloads, https:// for bulk FTP dumps). Files are processed sequentially. Memory stays bounded regardless of file size.

Parameters:

Name Type Description Default
*paths str | Path

One or more file paths (str or Path) to UniProt XML files.

()

Yields:

Type Description
UniProtEntry

UniProtEntry for each <entry> element in the XML.

Raises:

Type Description
ValueError

If no paths are provided.

Example::

from uniprotlib import parse_xml

for entry in parse_xml("uniprot_sprot.xml.gz"):
    print(entry.primary_accession, entry.organism.scientific_name)

parse_idmapping

uniprotlib.parse_idmapping(*paths, id_type=None)

Stream-parse one or more UniProt idmapping.dat files.

Yields one IdMapping per line (one accession–database–id triple). Accepts plain text or gzip-compressed files (auto-detected from .gz extension).

Parameters:

Name Type Description Default
*paths str | Path

One or more file paths (str or Path) to idmapping.dat files.

()
id_type str | None

If set, only yield rows matching this database type, e.g. "GeneID" or "RefSeq". Rows with other types are skipped.

None

Yields:

Type Description
IdMapping

IdMapping for each (matching) line in the file.

Raises:

Type Description
ValueError

If no paths are provided.

Example::

from uniprotlib import parse_idmapping

for m in parse_idmapping("idmapping.dat.gz", id_type="GeneID"):
    print(m.accession, m.id)

Models

UniProtEntry

uniprotlib.UniProtEntry dataclass

A single UniProtKB entry parsed from XML.

Attributes:

Name Type Description
primary_accession str

Primary accession, e.g. "Q9Y261".

accessions list[str]

All accessions including primary and secondary.

entry_name str

Mnemonic entry name, e.g. "FOXA2_HUMAN".

dataset str

"Swiss-Prot" or "TrEMBL".

protein_name str | None

Recommended full protein name. None if not annotated.

gene Gene | None

Gene names. None if the entry has no gene annotation.

organism Organism

Source organism with taxonomy.

sequence Sequence

Amino acid sequence with metadata.

keywords list[str]

UniProt keywords, e.g. ["Activator", "Nucleus"].

db_references list[DbReference]

Cross-references to external databases.

protein_existence str | None

Protein existence evidence level, e.g. "evidence at protein level". Possible values: "evidence at protein level", "evidence at transcript level", "inferred from homology", "predicted", "uncertain".

Organism

uniprotlib.Organism dataclass

Organism annotation from a UniProt entry.

Attributes:

Name Type Description
scientific_name str | None

Binomial name, e.g. "Homo sapiens".

common_name str | None

Vernacular name, e.g. "Human". None if not annotated.

tax_id str | None

NCBI Taxonomy identifier as a string, e.g. "9606".

lineage list[str]

Taxonomic lineage from root to most specific taxon, e.g. ["Eukaryota", ..., "Homo"].

Gene

uniprotlib.Gene dataclass

Gene names associated with a UniProt entry.

Attributes:

Name Type Description
primary str | None

Primary gene name, e.g. "FOXA2". None if not annotated.

synonyms list[str]

Alternative gene names, e.g. ["HNF3B", "TCF3B"].

ordered_locus_names list[str]

Systematic locus identifiers, e.g. ["b0001"].

orf_names list[str]

Open reading frame identifiers.

Sequence

uniprotlib.Sequence dataclass

Protein amino acid sequence.

Attributes:

Name Type Description
value str

Amino acid string (no whitespace), e.g. "MLGAVKMEG...".

length int

Number of amino acids.

mass int

Molecular mass in Daltons.

checksum str

CRC64 checksum of the sequence.

DbReference

uniprotlib.DbReference dataclass

Cross-reference to an external database.

Attributes:

Name Type Description
type str

Database name, e.g. "PDB", "RefSeq", "EMBL".

id str

Identifier in that database, e.g. "7YZE".

molecule str | None

Isoform identifier, e.g. "Q9Y261-1". None if not isoform-specific.

properties dict[str, str]

Additional key-value properties, e.g. {"method": "X-ray", "resolution": "1.99 A"}.

IdMapping

uniprotlib.IdMapping dataclass

Single row from a UniProt idmapping.dat file.

Each row maps a UniProt accession to one identifier in an external database.

Attributes:

Name Type Description
accession str

UniProtKB accession, e.g. "Q6GZX4".

id_type str

Database name, e.g. "GeneID", "RefSeq", "EMBL".

id str

Identifier in that database, e.g. "YP_031579.1".