Usage#

Cool-Seq-Tool provides easy access to, and useful operations on, a selection of important genomic resources. Modules are divided into three groups:

  • Data sources, for basic acquisition and setup for a data source via Python

  • Data handlers, for additional operations on top of existing sources

  • Data mappers, for functions that incorporate multiple sources/handlers to produce output

The core CoolSeqTool class encapsulates all of their functions and can be used for easy initialization and access:

>>> from cool_seq_tool import CoolSeqTool
>>> cst = CoolSeqTool()
>>> cst.seqrepo_access.translate_alias("NM_002529.3")[0][-1]
'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA'
>>> cst.transcript_mappings.ensembl_protein_for_gene_symbol["BRAF"][0]
'ENSP00000419060'
>>> await cst.uta_db.get_ac_from_gene("BRAF")
['NC_000007.14', 'NC_000007.13']

Descriptions and examples of functions can be found in the API Reference section.

Note

Many component classes in Cool-Seq-Tool, including UtaDatabase, ExonGenomicCoordsMapper, and ManeTranscript, define public methods as async. This means that, when used inside another function, they must be called with await:

from cool_seq_tool import CoolSeqTool

async def do_thing():
    mane_mapper = CoolSeqTool().mane_transcript
    result = mane_mapper.g_to_grch38("NC_000001.11", 100, 200)
    print(type(result))
    # <class 'coroutine'>
    awaited_result = await result
    print(awaited_result)
    # {'ac': 'NC_000001.11', 'pos': (100, 200)}

In a REPL, asyncio.run() can be used to call coroutines outside of functions. Many of our docstring examples will use this pattern.

>>> import asyncio
>>> from cool_seq_tool import cool_seq_tool
>>> mane_mapper = CoolSeqTool().mane_transcript
>>> result = asyncio.run(mane_mapper.g_to_grch38("NC_000001.11", 100, 200))
>>> print(result)
{'ac': 'NC_000001.11', 'pos': (100, 200)}

See the asyncio module documentation for more information.

Environment configuration#

Individual classes will accept arguments upon initialization to set parameters regarding data sources. In general, these parameters are also configurable via environment variables, e.g. in a cloud deployment.

Variable

Description

LRG_REFSEQGENE_PATH

Path to LRG_RefSeqGene file. Used in TranscriptMappings to provide mappings between gene symbols and RefSeq/Ensembl transcript accessions. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.

TRANSCRIPT_MAPPINGS_PATH

Path to transcript mapping file generated from Ensembl BioMart. Used in TranscriptMappings. If not defined, uses a copy of the file that is bundled within the Cool-Seq-Tool installation. See the contributor instructions for information on manually rebuilding it.

MANE_SUMMARY_PATH

Path to MANE Summary file. Used in ManeTranscriptMappings to provide MANE transcript annotations. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.

SEQREPO_ROOT_DIR

Path to SeqRepo directory (i.e. contains aliases.sqlite3 database file, and sequences directory). Used by SeqRepoAccess. If not defined, defaults to /usr/local/share/seqrepo/latest.

UTA_DB_URL

A libpq connection string, i.e. of the form postgresql://<user>:<password>@<host>:<port>/<database>/<schema>, used by the UtaDatabase class. By default, it is set to postgresql://anonymous@localhost:5432/uta/uta_20241220.

LIFTOVER_CHAIN_37_TO_38

A path to a chainfile for lifting from GRCh37 to GRCh38. Used by the LiftOver class as input to agct. If not provided, agct will fetch it automatically from UCSC.

LIFTOVER_CHAIN_38_TO_37

A path to a chainfile for lifting from GRCh38 to GRCh37. Used by the LiftOver class as input to agct. If not provided, agct will fetch it automatically from UCSC.