Genomic Regions
Genomic Regions Modules
These modules contain functions and classes for working with genomic regions. It provides utilities for handling and analyzing genomic coordinates, such as calculating overlaps, extracting sequences, and performing various genomic operations.
GRegion is a single region.
GRegions is a collection of many GRegion objects.
GRegionsSet is a set of many GRegions which represent different genomic elements.
- class genomkit.regions.gregion.GRegion(sequence: str, start: int, end: int, orientation: str = '.', name: str = '', score: float = 0, data: list = [])
GRegion module
This module contains functions and classes for working with a genomic region. It provides utilities for handling and analyzing a single genomic coordinate.
- distance(region)
Return the distance between two GRegions. If overlapping, return 0; if on different chromosomes, return None.
- extend(upstream: int = 0, downstream: int = 0, strandness: bool = False, inplace: bool = True)
Extend GRegion region with the given extension length.
- Parameters:
upstream (int) – Define how many bp to extend toward upstream direction.
downstream (int) – Define how many bp to extend toward downstream direction.
strandness (bool) – Define whether strandness is considered.
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object..
- Returns:
None or a GRegion object
- extend_fold(upstream: float = 0.0, downstream: float = 0.0, strandness: bool = False, inplace: bool = True)
Extend GRegion region with the given extension length in percentage.
- Parameters:
upstream (float) – Define the percentage of the region length to extend toward upstream direction.
downstream (float) – Define the percentage of the region length to extend toward downstream direction.
strandness (bool) – Define whether strandness is considered.
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object..
- Returns:
None or a GRegion object
- overlap(region, strandness=False)
Return True, if GRegion overlaps with the given region, else False.
- class genomkit.regions.gregions.GRegions(name: str = '', load: str = '')
GRegions module
This module contains functions and classes for working with a collection of genomic regions. It provides utilities for handling and analyzing the interactions of many genomic coordinates.
- add(region)
Append a GRegion at the end of the elements of GRegions.
- Parameters:
region (GRegion) – A GRegion
- close_regions(target, max_dis=10000)
Return a new GRegions including the region(s) of target which are closest to any self region. If there are intersection, return False.
- cluster(max_distance)
Cluster the regions with a certain distance and return a new GRegions.
- Parameters:
max_distance (int) – Maximal distance for combining
- Returns:
Combined regions
- Return type:
self ---- ---- ---- ---- ---- ---- Result(d=1) ------- --------- ---- ---- Result(d=10) ---------------------------------------------
- extend(upstream: int = 0, downstream: int = 0, strandness: bool = False, inplace: bool = True)
Perform extend step for every element. The extension length can also be negative values which shrinkages the regions.
- Parameters:
upstream (int) – Define how many bp to extend toward upstream direction.
downstream (int) – Define how many bp to extend toward downstream direction.
strandness (bool) – Define whether strandness is considered.
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
None or a GRegions object
- extend_fold(upstream: float = 0.0, downstream: float = 0.0, strandness: bool = False, inplace: bool = True)
Perform extend step for every element. The extension length can also be negative values which shrinkages the regions.
- Parameters:
upstream (float) – Define the percentage of the region length to extend toward upstream direction.
downstream (float) – Define the percentage of the region length to extend toward downstream direction.
strandness (bool) – Define whether strandness is considered.
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
None
- filter_by_names(names, inplace=False)
Filter the elements by the given list of names
- filter_by_score(larger_than=None, smaller_than=None, inplace=False)
Filter the elements by the given range of scores
- Parameters:
larger_than (float) – Define the minimal cutoff. Any region with the score larger than this value will be returned.
smaller_than (float) – Define the maximal cutoff. Any region with the score smaller than this value will be returned.
inplace (bool, default to True) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
A GRegions with filtered regions
- Return type:
- get_GSequences(FASTA_file)
Return a GSequences object according to the loci on the given reference FASTQ file.
- Parameters:
FASTA_file (str) – Path to the FASTA file
- Returns:
A GSequences.
- Return type:
- get_names(unique: bool = False)
Return a list of all region names. If the name is None, it return the region string.
- Returns:
A list of all regions’ names.
- Return type:
- intersect(target, mode: str = 'OVERLAP', rm_duplicates: bool = False)
Return a GRegions for the intersections between the two given GRegions objects. There are three modes for overlapping:
mode = “OVERLAP”
Return a new GRegions including only the overlapping regions with target GRegions.
Note
it will merge the regions.
self ---------- ------ target ---------- ---- Result ---
mode = “ORIGINAL”
Return the regions of original GenomicRegionSet which have any intersections with target GRegions.
self ---------- ------ target ---------- ---- Result ----------
mode = “COMP_INCL”
Return region(s) of the GenomicRegionSet which are ‘completely’ included by target GRegions.
self ------------- ------ target ---------- --------------- ---- Result ------
- load(filename: str)
Load a BED file into the GRegions.
- Parameters:
filename (str) – Path to the BED file
- merge(by_name: bool = False, strandness: bool = False, inplace: bool = False)
Merge the regions within the GRegions object.
- Parameters:
name_distinct (bool) – Define whether to merge regions by name. If True, only the regions with the same name are merged.
strandness (bool) – Define whether to merge the regions according to strandness.
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
None or a GRegions.
- Return type:
- resize(extend_upstream: int, extend_downstream: int, center='mid_point', inplace=True)
Resize the regions according to the defined center and extension.
- Parameters:
extend_upstream (int) – Define extension length toward upstream
extend_downstream (int) – Define extension length toward downstream
center (str, optional) – Define the new center, defaults to “mid_point”, other options are “5prime” and “3prime”
inplace (bool) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
A resized GRegion
- Return type:
- sampling(size: int, seed: int | None = None)
Return a sampling of the elements with a sampling number.
- split(ratio: float, size: int | None = None, seed: int | None = None)
Split the elements into two GRegions with the defined sizes.
- subtract(regions, whole_region: bool = False, merge: bool = True, exact: bool = False, inplace: bool = True)
Subtract regions from the self regions.
- Parameters:
regions (GRegions) – GRegions which to subtract by
whole_region (bool, default to False) – Subtract the whole region, not partially, defaults to False
merge (bool, default to True) – Merging the regions before subtracting
exact (bool, default to False) – Only regions which match exactly with a given region are subtracted. If True, whole_region and merge are completely ignored and the returned GRegions is sorted and does not contain duplicates.
inplace (bool, default to True) – Define whether this operation will be applied on the same object (True) or return a new object.
- Returns:
Remaining regions of self after subtraction
- Return type:
self ---------- ------ regions ---------- ---- Result ------- ------
- total_coverage()
Return the total coverage (bp) of all the regions.
- Returns:
Total coverage (bp)
- Return type:
- class genomkit.regions.gregions_set.GRegionsSet(name: str = '', load_dict=None)
GRegionsSet module
This module contains functions and classes for working with a collection of multiple GRegions.
- count_overlaps(query_set, percentage: bool = False)
Return a pandas dataframe of the numbers of overlapping regions between the reference GRegionsSet (self) and the query GRegionsSet.
- Parameters:
query_set (GRegionsSet) – Query GRegionsSet
percentage (bool, optional) – Convert the contingency table into percentage. The sum per row (reference) is 100%, defaults to False
- Returns:
Matrix of numbers of overlaps
- Return type:
dataframe
- get_lengths()
Return a list of the number of regions in all GRegions
- Returns:
A list of region numbers
- Return type:
- load_pattern(pattern)
Load multiple BED files with a regex pattern.
- Parameters:
pattern (str) – Regex pattern
- subtract(regions, whole_region: bool = False, merge: bool = True, exact: bool = False)
Perform inplace subtract in all GRegions.
- Parameters:
regions (GRegions) – GRegions which to subtract by
whole_region (bool, default to False) – Subtract the whole region, not partially, defaults to False
merge (bool, default to True) – Merging the regions before subtracting
exact (bool, default to False) – Only regions which match exactly with a given region are subtracted. If True, whole_region and merge are completely ignored and the returned GRegions is sorted and does not contain duplicates.