Genomic Regions

Genomic Regions Modules

These modules contain functions and classes for working with genomic regions. It provides utilities for handling and analyzing genomic coordinates, such as calculating overlaps, extracting sequences, and performing various genomic operations.

  • GRegion is a single region.

  • GRegions is a collection of many GRegion objects.

  • GRegionsSet is a set of many GRegions which represent different genomic elements.

class genomkit.regions.gregion.GRegion(sequence: str, start: int, end: int, orientation: str = '.', name: str = '', score: float = 0, data: list = [])

GRegion module

This module contains functions and classes for working with a genomic region. It provides utilities for handling and analyzing a single genomic coordinate.

bed_entry(data: bool = False)

Export regions in BED format

Parameters:

data (bool, optional) – Define whether to export extra data, defaults to False

Returns:

A string in BED format

Return type:

str

distance(region)

Return the distance between two GRegions. If overlapping, return 0; if on different chromosomes, return None.

Parameters:

region (GRegion) – A given GRegion object

Returns:

distance

Return type:

int or None if distance is not available

extend(upstream: int = 0, downstream: int = 0, strandness: bool = False, inplace: bool = True)

Extend GRegion region with the given extension length.

Parameters:
  • upstream (int) – Define how many bp to extend toward upstream direction.

  • downstream (int) – Define how many bp to extend toward downstream direction.

  • strandness (bool) – Define whether strandness is considered.

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object..

Returns:

None or a GRegion object

extend_fold(upstream: float = 0.0, downstream: float = 0.0, strandness: bool = False, inplace: bool = True)

Extend GRegion region with the given extension length in percentage.

Parameters:
  • upstream (float) – Define the percentage of the region length to extend toward upstream direction.

  • downstream (float) – Define the percentage of the region length to extend toward downstream direction.

  • strandness (bool) – Define whether strandness is considered.

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object..

Returns:

None or a GRegion object

overlap(region, strandness=False)

Return True, if GRegion overlaps with the given region, else False.

Parameters:
  • region (GRegion) – A given GRegion object

  • strandness (bool) – Define whether strandness is considered.

Returns:

True or False

Return type:

bool

resize(extend_upstream: int, extend_downstream: int, center='mid_point')

Return a resized GRegion according to the defined center and extension.

Parameters:
  • extend_upstream (int) – Define extension length toward upstream

  • extend_downstream (int) – Define extension length toward downstream

  • center (str, optional) – Define the new center, defaults to “mid_point”, other options are “5prime” and “3prime”

Returns:

A resized GRegion

Return type:

GRegion

class genomkit.regions.gregions.GRegions(name: str = '', load: str = '')

GRegions module

This module contains functions and classes for working with a collection of genomic regions. It provides utilities for handling and analyzing the interactions of many genomic coordinates.

add(region)

Append a GRegion at the end of the elements of GRegions.

Parameters:

region (GRegion) – A GRegion

close_regions(target, max_dis=10000)

Return a new GRegions including the region(s) of target which are closest to any self region. If there are intersection, return False.

Parameters:
  • target (GRegions) – the GRegions which to compare with

  • max_dis (int, optional) – maximum distance, defaults to 10000

Returns:

Close regions

Return type:

GRegions

cluster(max_distance)

Cluster the regions with a certain distance and return a new GRegions.

Parameters:

max_distance (int) – Maximal distance for combining

Returns:

Combined regions

Return type:

GRegions

self           ----           ----            ----
                  ----             ----                 ----
Result(d=1)    -------        ---------       ----      ----
Result(d=10)   ---------------------------------------------
extend(upstream: int = 0, downstream: int = 0, strandness: bool = False, inplace: bool = True)

Perform extend step for every element. The extension length can also be negative values which shrinkages the regions.

Parameters:
  • upstream (int) – Define how many bp to extend toward upstream direction.

  • downstream (int) – Define how many bp to extend toward downstream direction.

  • strandness (bool) – Define whether strandness is considered.

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

None or a GRegions object

extend_fold(upstream: float = 0.0, downstream: float = 0.0, strandness: bool = False, inplace: bool = True)

Perform extend step for every element. The extension length can also be negative values which shrinkages the regions.

Parameters:
  • upstream (float) – Define the percentage of the region length to extend toward upstream direction.

  • downstream (float) – Define the percentage of the region length to extend toward downstream direction.

  • strandness (bool) – Define whether strandness is considered.

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

None

filter_by_names(names, inplace=False)

Filter the elements by the given list of names

Parameters:
  • names (list) – A list of names for filtering

  • inplace (bool, default to True) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

A GRegions with filtered regions

Return type:

GRegions

filter_by_score(larger_than=None, smaller_than=None, inplace=False)

Filter the elements by the given range of scores

Parameters:
  • larger_than (float) – Define the minimal cutoff. Any region with the score larger than this value will be returned.

  • smaller_than (float) – Define the maximal cutoff. Any region with the score smaller than this value will be returned.

  • inplace (bool, default to True) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

A GRegions with filtered regions

Return type:

GRegions

get_GSequences(FASTA_file)

Return a GSequences object according to the loci on the given reference FASTQ file.

Parameters:

FASTA_file (str) – Path to the FASTA file

Returns:

A GSequences.

Return type:

GSequences

get_names(unique: bool = False)

Return a list of all region names. If the name is None, it return the region string.

Returns:

A list of all regions’ names.

Return type:

list

get_sequences(unique: bool = False)

Return all chromosomes.

Parameters:

unique (bool) – Only the unique names.

Returns:

A list of all chromosomes.

Return type:

list

intersect(target, mode: str = 'OVERLAP', rm_duplicates: bool = False)

Return a GRegions for the intersections between the two given GRegions objects. There are three modes for overlapping:

mode = “OVERLAP”

Return a new GRegions including only the overlapping regions with target GRegions.

Note

it will merge the regions.

self       ----------              ------
target            ----------                    ----
Result            ---

mode = “ORIGINAL”

Return the regions of original GenomicRegionSet which have any intersections with target GRegions.

self       ----------              ------
target          ----------                    ----
Result     ----------

mode = “COMP_INCL”

Return region(s) of the GenomicRegionSet which are ‘completely’ included by target GRegions.

self        -------------             ------
target              ----------      ---------------       ----
Result                                ------
Parameters:
  • target (GRegions) – A target GRegions for finding overlaps.

  • mode (str) – The mode should be one of the followings: “OVERLAP”, “ORIGINAL”, or “COMP_INCL”.

  • rm_duplicates (bool) – Define whether remove the duplicates.

Returns:

A GRegions.

Return type:

GRegions

load(filename: str)

Load a BED file into the GRegions.

Parameters:

filename (str) – Path to the BED file

merge(by_name: bool = False, strandness: bool = False, inplace: bool = False)

Merge the regions within the GRegions object.

Parameters:
  • name_distinct (bool) – Define whether to merge regions by name. If True, only the regions with the same name are merged.

  • strandness (bool) – Define whether to merge the regions according to strandness.

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

None or a GRegions.

Return type:

GRegions

remove_duplicates(sort: bool = True)

Remove any duplicate regions (sorted, by default).

resize(extend_upstream: int, extend_downstream: int, center='mid_point', inplace=True)

Resize the regions according to the defined center and extension.

Parameters:
  • extend_upstream (int) – Define extension length toward upstream

  • extend_downstream (int) – Define extension length toward downstream

  • center (str, optional) – Define the new center, defaults to “mid_point”, other options are “5prime” and “3prime”

  • inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

A resized GRegion

Return type:

GRegion

sampling(size: int, seed: int | None = None)

Return a sampling of the elements with a sampling number.

Parameters:
  • size (int) – Sampling number

  • seed (int, optional) – Seed for randomness, defaults to None

Returns:

Sampling regions

Return type:

GRegions

sort(key=None, reverse: bool = False)

Sort elements by criteria defined by a GenomicRegion.

Parameters:
  • key (str) – Given the key for comparison.

  • reverse (bool) – Reverse the sorting result.

split(ratio: float, size: int | None = None, seed: int | None = None)

Split the elements into two GRegions with the defined sizes.

Parameters:
  • ratio (float, optional) – Define the ratio for splitting

  • size (int, optional) – Define the size of the first GRegions, defaults to None

  • seed (int, optional) – _description_, defaults to None

Returns:

Two GRegions

Return type:

GRegions

subtract(regions, whole_region: bool = False, merge: bool = True, exact: bool = False, inplace: bool = True)

Subtract regions from the self regions.

Parameters:
  • regions (GRegions) – GRegions which to subtract by

  • whole_region (bool, default to False) – Subtract the whole region, not partially, defaults to False

  • merge (bool, default to True) – Merging the regions before subtracting

  • exact (bool, default to False) – Only regions which match exactly with a given region are subtracted. If True, whole_region and merge are completely ignored and the returned GRegions is sorted and does not contain duplicates.

  • inplace (bool, default to True) – Define whether this operation will be applied on the same object (True) or return a new object.

Returns:

Remaining regions of self after subtraction

Return type:

GRegions

self     ----------              ------
regions         ----------                    ----
Result   -------                 ------
total_coverage()

Return the total coverage (bp) of all the regions.

Returns:

Total coverage (bp)

Return type:

int

write(filename: str, data: bool = False)

Write a BED file.

Parameters:
  • filename (str) – Path to the BED file

  • data (bool) – Export extra data or not, defaults to False

class genomkit.regions.gregions_set.GRegionsSet(name: str = '', load_dict=None)

GRegionsSet module

This module contains functions and classes for working with a collection of multiple GRegions.

add(name: str, regions)

Add a GRegions object into this set.

Parameters:
  • name (str) – Define the name

  • regions (GRegions) – Given the GRegions

combine()

Return a GRegions by combining all regions.

Returns:

GRegions

Return type:

GRegions

count_overlaps(query_set, percentage: bool = False)

Return a pandas dataframe of the numbers of overlapping regions between the reference GRegionsSet (self) and the query GRegionsSet.

Parameters:
  • query_set (GRegionsSet) – Query GRegionsSet

  • percentage (bool, optional) – Convert the contingency table into percentage. The sum per row (reference) is 100%, defaults to False

Returns:

Matrix of numbers of overlaps

Return type:

dataframe

get_lengths()

Return a list of the number of regions in all GRegions

Returns:

A list of region numbers

Return type:

list

get_names()

Return the names of all GRegions.

Returns:

Names

Return type:

list

load_pattern(pattern)

Load multiple BED files with a regex pattern.

Parameters:

pattern (str) – Regex pattern

subtract(regions, whole_region: bool = False, merge: bool = True, exact: bool = False)

Perform inplace subtract in all GRegions.

Parameters:
  • regions (GRegions) – GRegions which to subtract by

  • whole_region (bool, default to False) – Subtract the whole region, not partially, defaults to False

  • merge (bool, default to True) – Merging the regions before subtracting

  • exact (bool, default to False) – Only regions which match exactly with a given region are subtracted. If True, whole_region and merge are completely ignored and the returned GRegions is sorted and does not contain duplicates.