cool_seq_tool.handlers.seqrepo_access#

Wrap SeqRepo to provide additional lookup and identification methods on top of basic dereferencing functions.

class cool_seq_tool.handlers.seqrepo_access.SeqRepoAccess(sr)[source]#

Provide a wrapper around the base SeqRepoDataProxy class from VRS-Python to provide additional lookup and identification methods.

ac_to_chromosome(ac)[source]#

Get chromosome for accession.

Parameters:

ac (str) – Accession

Return type:

Tuple[Optional[str], Optional[str]]

Returns:

Chromosome, warning

chromosome_to_acs(chromosome)[source]#

Get accessions for a chromosome

Parameters:

chromosome (str) – Chromosome number. Must be either 1-22, X, or Y

Return type:

Tuple[Optional[List[str]], Optional[str]]

Returns:

Accessions for chromosome (ordered by latest assembly)

get_fasta_file(sequence_id, outfile_path)[source]#

Retrieve FASTA file containing sequence for requested sequence ID.

>>> from pathlib import Path
>>> from cool_seq_tool.handlers import SeqRepoAccess
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest"))
>>> # write to local file tpm3.fasta:
>>> sr.get_fasta_file("NM_002529.3", Path("tpm3.fasta"))

FASTA file headers will include GA4GH sequence digest, Ensembl accession ID, and RefSeq accession ID.

Parameters:
  • sequence_id (str) – accession ID, sans namespace, eg NM_152263.3

  • outfile_path (Path) – path to save file to

Return type:

None

Returns:

None, but saves sequence data to outfile_path if successful

Raise:

KeyError if SeqRepo doesn’t have sequence data for the given ID

get_reference_sequence(ac, start=None, end=None, residue_mode=ResidueMode.RESIDUE)[source]#

Get reference sequence for an accession given a start and end position. If start and end are not given, returns the entire reference sequence.

>>> from cool_seq_tool.handlers import SeqRepoAccess
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest"))
>>> sr.get_reference_sequence("NM_002529.3", 1, 10)[0]
'TGCAGCTGG'
>>> sr.get_reference_sequence("NP_001341538.1", 1, 10)[0]
'MAALSGGGG'
Parameters:
  • ac (str) – Accession

  • start (Optional[int]) – Start pos change

  • end (Optional[int]) – End pos change. If None assumes both start and end have same values, if start exists.

  • residue_mode (ResidueMode) – Residue mode for start and end

Return type:

Tuple[str, Optional[str]]

Returns:

Sequence at position (if accession and positions actually exist, else return empty string), warning if any

translate_alias(input_str)[source]#

Get aliases for a given input.

Parameters:

input_str (str) – Input to get aliases for

Return type:

Tuple[List[Optional[str]], Optional[str]]

Returns:

List of aliases, warning

translate_identifier(ac, target_namespaces=None)[source]#

Return list of identifiers for accession.

>>> from cool_seq_tool.handlers import SeqRepoAccess
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest"))
>>> sr.translate_identifier("NM_002529.3")[0]
['MD5:18f0a6e3af9e1bbd8fef1948c7156012', 'NCBI:NM_002529.3', 'refseq:NM_002529.3', 'SEGUID:dEJQBkga9d9VeBHTyTbg6JEtTGQ', 'SHA1:74425006481af5df557811d3c936e0e8912d4c64', 'VMC:GS_RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA', 'sha512t24u:RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA', 'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA']
>>> sr.translate_identifier("NM_002529.3", "ga4gh")[0]
['ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA']
Parameters:
  • ac (str) – Identifier accession

  • target_namespace – The namespace(s) of identifier to return

Return type:

Tuple[List[str], Optional[str]]

Returns:

List of identifiers, warning