cool_seq_tool.mappers.mane_transcript#

Retrieve MANE transcript from a location on p./c./g. coordinates.

Steps:

  1. Map annotation layer to genome

  2. Liftover to preferred genome (GRCh38). GRCh36 and earlier assemblies are not supported for fetching MANE transcripts.

  3. Select preferred compatible annotation (see transcript compatibility)

  4. Map back to correct annotation layer

In addition to a mapper utility class, this module also defines several vocabulary constraints and data models for coordinate representation.

class cool_seq_tool.mappers.mane_transcript.CdnaRepresentation(**data)[source]#

Define object model for coding DNA representation

alt_ac: Optional[str][source]#
coding_end_site: int[source]#
coding_start_site: int[source]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

class cool_seq_tool.mappers.mane_transcript.DataRepresentation(**data)[source]#

Define object model for final output representation

ensembl: Optional[str][source]#
gene: Optional[str][source]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

pos: Tuple[int, int][source]#
refseq: str[source]#
status: TranscriptPriority[source]#
strand: Strand[source]#
class cool_seq_tool.mappers.mane_transcript.EndAnnotationLayer(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Define constraints for end annotation layer. This is used for determining the end annotation layer when getting the longest compatible remaining representation

CDNA = 'AnnotationLayer.CDNA'[source]#
PROTEIN = 'AnnotationLayer.PROTEIN'[source]#
PROTEIN_AND_CDNA = 'p_and_c'[source]#
class cool_seq_tool.mappers.mane_transcript.GenomicRepresentation(**data)[source]#

Define object model for genomic representation

alt_ac: str[source]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

pos: Tuple[int, int][source]#
refseq: str[source]#
status: TranscriptPriority[source]#
class cool_seq_tool.mappers.mane_transcript.ManeTranscript(seqrepo_access, transcript_mappings, mane_transcript_mappings, uta_db)[source]#

Class for retrieving MANE transcripts.

__init__(seqrepo_access, transcript_mappings, mane_transcript_mappings, uta_db)[source]#

Initialize the ManeTranscript class.

A handful of resources are required for initialization, so when defaults are enough, it’s easiest to let the core CoolSeqTool class handle it for you:

>>> from cool_seq_tool.app import CoolSeqTool
>>> mane_mapper = CoolSeqTool().mane_transcript

Note that most methods are defined as Python coroutines, so they must be called with await or run from an async event loop:

>>> import asyncio
>>> result = asyncio.run(mane_mapper.g_to_grch38("NC_000001.11", 100, 200))
>>> result['ac']
'NC_000001.11'

See the Usage section for more information.

Parameters:
  • seqrepo_access (SeqRepoAccess) – Access to seqrepo queries

  • transcript_mappings (TranscriptMappings) – Access to transcript accession mappings and conversions

  • mane_transcript_mappings (ManeTranscriptMappings) – Access to MANE Transcript accession mapping data

  • uta_db (UtaDatabase) – UtaDatabase instance to give access to query UTA database

async g_to_grch38(ac, start_pos, end_pos)[source]#

Return genomic coordinate on GRCh38 when not given gene context.

Parameters:
  • ac (str) – Genomic accession

  • start_pos (int) – Genomic start position

  • end_pos (int) – Genomic end position

Return type:

Optional[Dict]

Returns:

NC accession, start and end pos on GRCh38 assembly

async g_to_mane_c(ac, start_pos, end_pos, gene=None, residue_mode=ResidueMode.RESIDUE)[source]#

Return MANE Transcript on the c. coordinate.

If an arg for gene is provided, lifts to GRCh38, then gets MANE cDNA representation.

>>> import asyncio
>>> from cool_seq_tool.app import CoolSeqTool
>>> cst = CoolSeqTool()
>>> result = asyncio.run(cst.mane_transcript.g_to_mane_c(
...     "NC_000007.13",
...     55259515,
...     None,
...     gene="EGFR"
... ))
>>> type(result)
<class 'cool_seq_tool.mappers.mane_transcript.CdnaRepresentation'>
>>> result.status
<TranscriptPriority.MANE_SELECT: 'mane_select'>
>>> del cst

Locating a MANE transcript requires a gene symbol argument – if none is given, this method will only lift over to genomic coordinates on GRCh38.

Parameters:
  • ac (str) – Transcript accession on g. coordinate

  • start_pos (int) – genomic start position

  • end_pos (int) – genomic end position

  • gene (Optional[str]) – HGNC gene symbol

  • residue_mode (ResidueMode) – Starting residue mode for start_pos and end_pos. Will always return coordinates in inter-residue.

Return type:

Union[GenomicRepresentation, CdnaRepresentation, None]

Returns:

MANE Transcripts with cDNA change on c. coordinate if gene is provided. Else, GRCh38 data

async get_longest_compatible_transcript(start_pos, end_pos, start_annotation_layer, gene=None, ref=None, residue_mode=ResidueMode.RESIDUE, mane_transcripts=None, alt_ac=None, end_annotation_layer=None)[source]#

Get longest compatible transcript from a gene. See the documentation for the transcript compatibility policy for more information.

>>> import asyncio
>>> from cool_seq_tool.app import CoolSeqTool
>>> from cool_seq_tool.schemas import AnnotationLayer, ResidueMode
>>> mane_mapper = CoolSeqTool().mane_transcript
>>> mane_transcripts = {
...     "ENST00000646891.2",
...     "NM_001374258.1",
...     "NM_004333.6",
...     "ENST00000644969.2",
... }
>>> result = asyncio.run(mane_mapper.get_longest_compatible_transcript(
...     599,
...     599,
...     gene="BRAF",
...     start_annotation_layer=AnnotationLayer.PROTEIN,
...     residue_mode=ResidueMode.INTER_RESIDUE,
...     mane_transcripts=mane_transcripts,
... ))
>>> result.refseq
'NP_001365396.1'

If unable to find a match on GRCh38, this method will then attempt to drop down to GRCh37.

# TODO example for inputs that demonstrate this?

Parameters:
  • start_pos (int) – Start position change

  • end_pos (int) – End position change

  • start_annotation_layer (AnnotationLayer) – Starting annotation layer

  • gene (Optional[str]) – HGNC gene symbol

  • ref (Optional[str]) – Reference at position given during input

  • residue_mode (ResidueMode) – Residue mode for start_pos and end_pos

  • mane_transcripts (Optional[Set]) – Attempted mane transcripts that were not compatible

  • alt_ac (Optional[str]) – Genomic accession

  • end_annotation_layer (Optional[EndAnnotationLayer]) – The end annotation layer. If not provided, will be set to EndAnnotationLayer.PROTEIN if start_annotation_layer == AnnotationLayer.PROTEIN, EndAnnotationLayer.CDNA otherwise

Return type:

Union[DataRepresentation, CdnaRepresentation, ProteinAndCdnaRepresentation, None]

Returns:

Data for longest compatible transcript

static get_mane_c_pos_change(mane_tx_genomic_data, coding_start_site)[source]#

Get mane c position change

Parameters:
  • mane_tx_genomic_data (Dict) – MANE transcript and genomic data

  • coding_start_site (int) – Coding start site

Return type:

Tuple[int, int]

Returns:

cDNA pos start, cDNA pos end

async get_mane_transcript(ac, start_pos, end_pos, start_annotation_layer, gene=None, ref=None, try_longest_compatible=False, residue_mode=ResidueMode.RESIDUE)[source]#

Return MANE transcript.

>>> from cool_seq_tool.app import CoolSeqTool
>>> from cool_seq_tool.schemas import AnnotationLayer, ResidueMode
>>> import asyncio
>>> mane_mapper = CoolSeqTool().mane_transcript
>>> result = asyncio.run(mane_mapper.get_mane_transcript(
...     "NP_004324.2",
...     599,
...     AnnotationLayer.PROTEIN,
...     residue_mode=ResidueMode.INTER_RESIDUE,
... ))
>>> result.gene, result.refseq, result.status
('BRAF', 'NP_004324.2', <TranscriptPriority.MANE_SELECT: 'mane_select'>)
Parameters:
  • ac (str) – Accession

  • start_pos (int) – Start position change

  • end_pos (int) – End position change

  • start_annotation_layer (AnnotationLayer) – Starting annotation layer.

  • gene (Optional[str]) – HGNC gene symbol

  • ref (Optional[str]) – Reference at position given during input

  • try_longest_compatible (bool) – True if should try longest compatible remaining if mane transcript was not compatible. False otherwise.

  • residue_mode (ResidueMode) – Starting residue mode for start_pos and end_pos. Will always return coordinates in inter-residue

Return type:

Union[DataRepresentation, CdnaRepresentation, None]

Returns:

MANE data or longest transcript compatible data if validation checks are correct. Will return inter-residue coordinates. Else, None.

async grch38_to_mane_c_p(alt_ac, start_pos, end_pos, gene=None, residue_mode=ResidueMode.RESIDUE, try_longest_compatible=False)[source]#

Given GRCh38 genomic representation, return protein representation.

Will try MANE Select and then MANE Plus Clinical. If neither is found and try_longest_compatible is set to true, will also try to find the longest compatible remaining representation.

Parameters:
  • alt_ac (str) – Genomic RefSeq accession on GRCh38

  • start_pos (int) – Start position

  • end_pos (int) – End position

  • gene (Optional[str]) – HGNC gene symbol

  • residue_mode (ResidueMode) – Starting residue mode for start_pos and end_pos. Will always return coordinates as inter-residue.

  • try_longest_compatible (bool) – True if should try longest compatible remaining if mane transcript(s) not compatible. False otherwise.

Return type:

Optional[Dict]

Returns:

If successful, return MANE data or longest compatible remaining (if try_longest_compatible set to True) cDNA and protein representation. Will return inter-residue coordinates.

class cool_seq_tool.mappers.mane_transcript.ProteinAndCdnaRepresentation(**data)[source]#

Define object model for protein and cDNA representation

cdna: CdnaRepresentation[source]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

protein: DataRepresentation[source]#