Starting with a GTF file
Note
These tutorials are still under active development.
Because GAnnotation is able to handle both GTF and GFF, you can replace the GTF file in the tutorials below with the GFF file. Here we show only GTFs as examples.
Get all the genes from a GTF file
GTF_hg38 is the path to the hg38 GTF file for annotation.
from genomkit import GAnnotation
gtf = GAnnotation(file_path=GTF_hg38, file_format="gtf")
genes = gtf.get_regions(element_type="gene")
genes.write(filename="hg38_genes.bed")
Extract exon, intron, and intergenetic regions in BED format from a GTF file
GTF_hg38 is the path to the hg38 GTF file for annotation. Now you want to generate 3 BED files as below:
hg38_exons.bed
hg38_introns.bed
hg38_intergenic_regions.bed
from genomkit import GRegions
from genomkit import GAnnotation
gtf = GAnnotation(file_path=GTF_hg38, file_format="gtf")
genes = gtf.get_regions(element_type="gene")
exons = gtf.get_regions(element_type="exon")
introns = genes.subtract(exons, inplace=False)
chromosomes = GRegions(name="chromosomes")
chromosomes.get_chromosomes(organism="hg38")
intergenic_regions = chromosomes.subtract(genes, inplace=False)
exons.write(filename="hg38_exons.bed")
introns.write(filename="hg38_introns.bed")
intergenic_regions.write(filename="hg38_intergenic_regions.bed")
Get all promoter regions in BED format from a GTF file
GTF_hg38 is the path to the hg38 GTF file for annotation. Now you want to generate 3 BED files as below:
from genomkit import GAnnotation
gtf = GAnnotation(file_path=GTF_hg38, file_format="gtf")
genes = gtf.get_regions(element_type="gene")
promoters = genes.resize(extend_upstream=2000,
extend_downstream=0,
center="5prime", inplace=False)
promoters.write(filename="hg38_promoters.bed")
Extract the genes by their biotypes from a GTF file
GTF_hg38 is the path to the hg38 GTF file for annotation. Now you want to generate BED files for the biotypes below:
protein_coding
lncRNA
snRNA
miRNA
from genomkit import GAnnotation
gtf = GAnnotation(file_path=GTF_hg38, file_format="gtf")
target_biotypes = ["protein_coding", "lncRNA", "snRNA", "miRNA"]
for biotype in target_biotypes:
genes = gtf.get_regions(element_type="gene",
attribute="gene_type", value=biotype)
genes.write(filename="hg38_genes_"+biotype+".bed")