Usage#

Cool-Seq-Tool provides easy access to, and useful operations on, a selection of important genomic resources. Modules are divided into three groups:

  • Data sources, for basic acquisition and setup for a data source via Python

  • Data handlers, for additional operations on top of existing sources

  • Data mappers, for functions that incorporate multiple sources/handlers to produce output

The core CoolSeqTool class encapsulates all of their functions and can be used for easy initialization and access:

>>> from cool_seq_tool import CoolSeqTool
>>> cst = CoolSeqTool()
>>> cst.seqrepo_access.translate_alias("NM_002529.3")[0][-1]
'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA'
>>> cst.transcript_mappings.ensembl_protein_for_gene_symbol["BRAF"][0]
'ENSP00000419060'
>>> await cst.uta_db.get_ac_from_gene("BRAF")
['NC_000007.14', 'NC_000007.13']

Descriptions and examples of functions can be found in the API Reference section.

Note

Many component classes in CoolSeqTool, including UtaDatabase, ExonGenomicCoordsMapper, and ManeTranscript, define public methods as async. This means that, when used inside another function, they must be called with await:

from cool_seq_tool import CoolSeqTool

async def do_thing():
    mane_mapper = CoolSeqTool().mane_transcript
    result = mane_mapper.g_to_grch38("NC_000001.11", 100, 200)
    print(type(result))
    # <class 'coroutine'>
    awaited_result = await result
    print(awaited_result)
    # {'ac': 'NC_000001.11', 'pos': (100, 200)}

In a REPL, asyncio.run() can be used to call coroutines outside of functions. Many of our docstring examples will use this pattern.

>>> import asyncio
>>> from cool_seq_tool import cool_seq_tool
>>> mane_mapper = CoolSeqTool().mane_transcript
>>> result = asyncio.run(mane_mapper.g_to_grch38("NC_000001.11", 100, 200))
>>> print(result)
{'ac': 'NC_000001.11', 'pos': (100, 200)}

See the asyncio module documentation for more information.

Environment configuration#

Individual classes will accept arguments upon initialization to set parameters regarding data sources. In general, these parameters are also configurable via environment variables, e.g. in a cloud deployment.

Variable

Description

LRG_REFSEQGENE_PATH

Path to LRG_RefSeqGene file. Used in TranscriptMappings to provide mappings between gene symbols and RefSeq/Ensembl transcript accessions. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.

TRANSCRIPT_MAPPINGS_PATH

Path to transcript mapping file generated from Ensembl BioMart. Used in TranscriptMappings. If not defined, uses a copy of the file that is bundled within the Cool-Seq-Tool installation. See the contributor instructions for information on manually rebuilding it.

MANE_SUMMARY_PATH

Path to MANE Summary file. Used in ManeTranscriptMappings to provide MANE transcript annotations. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.

SEQREPO_ROOT_DIR

Path to SeqRepo directory (i.e. contains aliases.sqlite3 database file, and sequences directory). Used by SeqRepoAccess. If not defined, defaults to /usr/local/share/seqrepo/latest.

UTA_DB_URL

A libpq connection string, i.e. of the form postgresql://<user>:<password>@<host>:<port>/<database>/<schema>, used by the UtaDatabase class. By default, it is set to postgresql://uta_admin:uta@localhost:5432/uta/uta_20210129b.

LIFTOVER_CHAIN_37_TO_38

A path to a chainfile for lifting from GRCh37 to GRCh38. Used by the LiftOver class as input to agct. If not provided, agct will fetch it automatically from UCSC.

LIFTOVER_CHAIN_38_TO_37

A path to a chainfile for lifting from GRCh38 to GRCh37. Used by the LiftOver class as input to agct. If not provided, agct will fetch it automatically from UCSC.