Usage#

Cool-Seq-Tool provides easy access to, and useful operations on, a selection of important genomic resources. Modules are divided into three groups:

Data sources, for basic acquisition and setup for a data source via Python
Data handlers, for additional operations on top of existing sources
Data mappers, for functions that incorporate multiple sources/handlers to produce output

The core CoolSeqTool class encapsulates all of their functions and can be used for easy initialization and access:

>>> from cool_seq_tool import CoolSeqTool
>>> cst = CoolSeqTool()
>>> cst.seqrepo_access.translate_alias("NM_002529.3")[0][-1]
'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA'
>>> cst.transcript_mappings.ensembl_protein_for_gene_symbol["BRAF"][0]
'ENSP00000419060'
>>> await cst.uta_db.get_ac_from_gene("BRAF")
['NC_000007.14', 'NC_000007.13']

Descriptions and examples of functions can be found in the API Reference section.

Note

Many component classes in CoolSeqTool, including UtaDatabase, ExonGenomicCoordsMapper, and ManeTranscript, define public methods as async. This means that, when used inside another function, they must be called with await:

from cool_seq_tool import CoolSeqTool

async def do_thing():
    mane_mapper = CoolSeqTool().mane_transcript
    result = mane_mapper.g_to_grch38("NC_000001.11", 100, 200)
    print(type(result))
    # <class 'coroutine'>
    awaited_result = await result
    print(awaited_result)
    # {'ac': 'NC_000001.11', 'pos': (100, 200)}

In a REPL, asyncio.run() can be used to call coroutines outside of functions. Many of our docstring examples will use this pattern.

>>> import asyncio
>>> from cool_seq_tool import cool_seq_tool
>>> mane_mapper = CoolSeqTool().mane_transcript
>>> result = asyncio.run(mane_mapper.g_to_grch38("NC_000001.11", 100, 200))
>>> print(result)
{'ac': 'NC_000001.11', 'pos': (100, 200)}

See the asyncio module documentation for more information.

Environment configuration#

Individual classes will accept arguments upon initialization to set parameters regarding data sources. In general, these parameters are also configurable via environment variables, e.g. in a cloud deployment.

Variable	Description
`LRG_REFSEQGENE_PATH`	Path to LRG_RefSeqGene file. Used in `TranscriptMappings` to provide mappings between gene symbols and RefSeq/Ensembl transcript accessions. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.
`TRANSCRIPT_MAPPINGS_PATH`	Path to transcript mapping file generated from Ensembl BioMart. Used in `TranscriptMappings`. If not defined, uses a copy of the file that is bundled within the Cool-Seq-Tool installation. See the contributor instructions for information on manually rebuilding it.
`MANE_SUMMARY_PATH`	Path to MANE Summary file. Used in `ManeTranscriptMappings` to provide MANE transcript annotations. If not defined, uses wags-tails to fetch the latest version, downloading it from the NCBI server if necessary.
`SEQREPO_ROOT_DIR`	Path to SeqRepo directory (i.e. contains `aliases.sqlite3` database file, and `sequences` directory). Used by `SeqRepoAccess`. If not defined, defaults to `/usr/local/share/seqrepo/latest`.
`UTA_DB_URL`	A libpq connection string, i.e. of the form `postgresql://<user>:<password>@<host>:<port>/<database>/<schema>`, used by the `UtaDatabase` class. By default, it is set to `postgresql://uta_admin:uta@localhost:5432/uta/uta_20210129b`.
`LIFTOVER_CHAIN_37_TO_38`	A path to a chainfile for lifting from GRCh37 to GRCh38. Used by the `LiftOver` class as input to agct. If not provided, agct will fetch it automatically from UCSC.
`LIFTOVER_CHAIN_38_TO_37`	A path to a chainfile for lifting from GRCh38 to GRCh37. Used by the `LiftOver` class as input to agct. If not provided, agct will fetch it automatically from UCSC.