Why I recommend preserve ENSGID in data analysis

I am currently using Seurat for my scRNA-seq and snRNA-seq analysis (10X), and everything seems to be working fine. However, when I tried to create a Scanpy object from a Cellbender-corrected h5 result file, similar to how I create objects in Seurat, I stumbled upon an issue where there are duplicated gene names. I'm unsure about the root cause of this situation.

with h5py.File(file_path, 'r') as f:

    gene_names = [x.decode('utf-8') for x in f['matrix']['features']['name'][:]]
    gene_ids = [x.decode('utf-8') for x in f['matrix']['features']['id'][:]]
    

    expression_matrix = f['matrix']['data'][:]
    indices = f['matrix']['indices'][:]
    indptr = f['matrix']['indptr'][:]
    shape = f['matrix']['shape'][:]
    

    from scipy.sparse import csc_matrix
    expression_matrix = csc_matrix((expression_matrix, indices, indptr), shape=shape)

# Check duplicated gene names in detail
duplicated_genes = pd.Series(gene_names)[pd.Series(gene_names).duplicated(keep=False)]
print(f"Total number of genes: {len(gene_names)}")
print(f"Number of duplicated gene names: {len(duplicated_genes.unique())}")
print("\nDuplicated gene names and their counts:")
print(duplicated_genes.value_counts().head())

Total number of genes: 55487
Number of duplicated gene names: 114

Duplicated gene names and their counts:
Ccl27a     3
Ccl21b     3
Il11ra2    3
Gm35558    3
Pakap      3
Name: count, dtype: int64

After extracting the duplicated genes, everything became clear. The issue was caused by a single gene name corresponding to multiple ENSGIDs.

Subsequently, I noticed that during the creation of objects in Seurat, these duplicated genes were appended with suffixes like ".1".

Considering that I will need to perform integration and enrichment analysis using ENSGIDs later, I believe it is better to retain ENSGIDs throughout the analysis. Ideally, both gene names and ENSGIDs should be preserved. It seems that in Python/Scanpy, this can be achieved during object creation using the code below. I am now looking for a solution to achieve the same in R/Seurat.

adata = anndata.AnnData(X=expression_matrix.T)
adata.var_names = gene_names
adata.var['gene_ids'] = gene_ids

Why I recommend preserve ENSGID in data analysis

GetNF | The Easy Way to Install Nerd Fonts

qs | A Faster and More Efficient Alternative to RDS in R

Comments NOTHING

Cancel Reply