Compass Pseudobulking#
The runtime of Compass may become prohibitively long for large-scale single-cell RNA-seq datasets. To mitigate this issue, we suggest that users perform pseudobulking on the data before running Compass. This notebook demonstrates how pseudobulking can be performed and runs Compass on pseudobulked data.
The dataset used in this tutorial is from Smillie, C. S. et al. Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730.e22 (2019) (link). You can download the expression data from the Single Cell Portal.
The files you need are:
Metadata
all.meta2.txt
Epithelial data
Epi.genes.tsv
Epi.barcodes2.tsv
gene_sorted-Epi.matrix.mtx
Stromal data
Fib.genes.tsv
Fib.barcodes2.tsv
gene_sorted-Fib.matrix.mtx
Immune data
Imm.genes.tsv
Imm.barcodes2.tsv
gene_sorted-Imm.matrix.mtx
Also download the cell_subsets.txt file from the GitHub repo provided by this study.
Place the aforementioned files in the same directory as this notebook.
Requirements#
All of the required python packages can be installed using a cell at the start of the notebook. To install the required python packages, you can uncomment the “install_reqs()” call.
def install_reqs():
!pip install pandas
!pip install matplotlib
!pip install numpy
!pip install scipy
!pip install anndata
!pip install scanpy
#install_reqs()
Pseudobulking#
This section of the tutorial will demonstrate how to perform pseudobulking on a large scRNA-seq dataset. If you already have an AnnData object that you can process directly, you may move on to this section.
import pandas as pd
import numpy as np
from scipy.io import mmread
from scipy import sparse
from scipy.sparse import csr_matrix, vstack
import os
import matplotlib.pyplot as plt
import scipy.stats
import anndata as ad
import scanpy as sc
DATA_DIR = '/data/yosef2/scratch/users/charleschien101/ibd_data'
We read in the metadata file here. We will be performing pseudobulking based on the Health, Sample, and Cluster columns.
meta = pd.read_csv(os.path.join(DATA_DIR, 'all.meta2.txt'), sep='\t')
meta
| NAME | Cluster | nGene | nUMI | Subject | Health | Location | Sample | |
|---|---|---|---|---|---|---|---|---|
| 0 | TYPE | group | numeric | numeric | group | group | group | group |
| 1 | N7.EpiA.AAACATACACACTG | TA 1 | 328 | 891 | N7 | Non-inflamed | Epi | N7.EpiA |
| 2 | N7.EpiA.AAACCGTGCATCAG | TA 1 | 257 | 663 | N7 | Non-inflamed | Epi | N7.EpiA |
| 3 | N7.EpiA.AAACGCACAATCGC | TA 2 | 300 | 639 | N7 | Non-inflamed | Epi | N7.EpiA |
| 4 | N7.EpiA.AAAGATCTAACCGT | Enterocyte Progenitors | 250 | 649 | N7 | Non-inflamed | Epi | N7.EpiA |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 365488 | N110.LPB.TTTGGTTAGGATGGTC | Macrophages | 635 | 1366 | N110 | Inflamed | LP | N110.LPB |
| 365489 | N110.LPB.TTTGGTTCACCTCGTT | Plasma | 610 | 2730 | N110 | Inflamed | LP | N110.LPB |
| 365490 | N110.LPB.TTTGGTTTCGGAAACG | Macrophages | 859 | 1979 | N110 | Inflamed | LP | N110.LPB |
| 365491 | N110.LPB.TTTGTCAGTTGACGTT | Macrophages | 965 | 2696 | N110 | Inflamed | LP | N110.LPB |
| 365492 | N110.LPB.TTTGTCATCTGACCTC | CD69+ Mast | 559 | 1156 | N110 | Inflamed | LP | N110.LPB |
365493 rows × 8 columns
The Cluster column corresponds to various cell types. Since the cell types provided in the Cluster column is rather fine-grained, we used the more coarse-grained cell types provided in the cell_subsets.txt file.
cell_subsets = pd.read_csv(os.path.join(DATA_DIR, 'cell_subsets.txt'), sep='\t', names=['Cluster', 'CellType'], header=None)
cell_subsets.head()
| Cluster | CellType | |
|---|---|---|
| 0 | Stem | Epithelial |
| 1 | TA 1 | Epithelial |
| 2 | TA 2 | Epithelial |
| 3 | Cycling TA | Epithelial |
| 4 | Immature Enterocytes 1 | Epithelial |
The dataset used in this notebook separately stores expression data for epithelial cells, stromal cells, and immune cells. Here we read in the barcodes (cell names), gene names, and expression matrix.
epi_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Epi.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)
fib_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Fib.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)
imm_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Imm.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)
epi_genes = pd.read_csv(os.path.join(DATA_DIR, 'Epi.genes.tsv'), sep='\t', names=['genes'], header=None)
fib_genes = pd.read_csv(os.path.join(DATA_DIR, 'Fib.genes.tsv'), sep='\t', names=['genes'], header=None)
imm_genes = pd.read_csv(os.path.join(DATA_DIR, 'Imm.genes.tsv'), sep='\t', names=['genes'], header=None)
epi_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Epi.matrix.mtx')))
epi_matrix
<20028x123006 sparse matrix of type '<class 'numpy.int64'>'
with 174423911 stored elements in COOrdinate format>
fib_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Fib.matrix.mtx')))
fib_matrix
<19076x31872 sparse matrix of type '<class 'numpy.int64'>'
with 39087919 stored elements in COOrdinate format>
imm_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Imm.matrix.mtx')))
imm_matrix
<20529x210614 sparse matrix of type '<class 'numpy.int64'>'
with 173255714 stored elements in COOrdinate format>
Since the three expression matrices contain different numbers of genes, we need to fill in the missing gene expression data (with zeros) to be able to integrate the three datasets together. We also create AnnData objects from the expression data for easier data manipulation. If you are not familiar with AnnData objects, you can refer to the AnnData documentation.
all_genes = list(set.union(set(epi_genes['genes']), set(fib_genes['genes']), set(imm_genes['genes'])))
len(all_genes)
21784
epi_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - epi_matrix.shape[0], epi_matrix.shape[1]]), dtype=np.int64)
epi_matrix_combined = vstack((epi_matrix, epi_matrix_zeros))
epi_matrix_combined
<21784x123006 sparse matrix of type '<class 'numpy.int64'>'
with 174423911 stored elements in COOrdinate format>
epi_remaining_genes = list(set(all_genes) - set(epi_genes['genes']))
epi_adata = ad.AnnData(epi_matrix_combined.tocsr().transpose())
epi_adata.obs_names = list(epi_barcodes['barcodes'])
epi_adata.var_names = list(epi_genes['genes']) + epi_remaining_genes
epi_adata = epi_adata[:, epi_adata.var_names.sort_values()]
epi_adata
View of AnnData object with n_obs × n_vars = 123006 × 21784
fib_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - fib_matrix.shape[0], fib_matrix.shape[1]]), dtype=np.int64)
fib_matrix_combined = vstack((fib_matrix, fib_matrix_zeros))
fib_matrix_combined
<21784x31872 sparse matrix of type '<class 'numpy.int64'>'
with 39087919 stored elements in COOrdinate format>
fib_remaining_genes = list(set(all_genes) - set(fib_genes['genes']))
fib_adata = ad.AnnData(fib_matrix_combined.tocsr().transpose())
fib_adata.obs_names = list(fib_barcodes['barcodes'])
fib_adata.var_names = list(fib_genes['genes']) + fib_remaining_genes
fib_adata = fib_adata[:, fib_adata.var_names.sort_values()]
fib_adata
View of AnnData object with n_obs × n_vars = 31872 × 21784
imm_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - imm_matrix.shape[0], imm_matrix.shape[1]]), dtype=np.int64)
imm_matrix_combined = vstack((imm_matrix, imm_matrix_zeros))
imm_matrix_combined
<21784x210614 sparse matrix of type '<class 'numpy.int64'>'
with 173255714 stored elements in COOrdinate format>
imm_remaining_genes = list(set(all_genes) - set(imm_genes['genes']))
imm_adata = ad.AnnData(imm_matrix_combined.tocsr().transpose())
imm_adata.obs_names = list(imm_barcodes['barcodes'])
imm_adata.var_names = list(imm_genes['genes']) + imm_remaining_genes
imm_adata = imm_adata[:, imm_adata.var_names.sort_values()]
imm_adata
View of AnnData object with n_obs × n_vars = 210614 × 21784
epi_var_names = list(epi_adata.var_names)
fib_var_names = list(fib_adata.var_names)
imm_var_names = list(imm_adata.var_names)
assert epi_var_names == fib_var_names
assert epi_var_names == imm_var_names
After ensuring that all three expression matrices contain the same genes, we concatenate the three AnnData objects together.
adata = ad.concat([epi_adata, fib_adata, imm_adata])
adata
AnnData object with n_obs × n_vars = 365492 × 21784
We confirm that the cells in the AnnData object are arranged in the same order as in the metadata file. This allows us to easily integrate metadata into the AnnData object.
assert list(adata.obs_names) == list(meta['NAME'][1:])
adata.obs = meta.iloc[1:, ][['NAME', 'Cluster', 'Subject', 'Health', 'Location', 'Sample']].reset_index(drop=True)
adata.obs
| NAME | Cluster | Subject | Health | Location | Sample | |
|---|---|---|---|---|---|---|
| 0 | N7.EpiA.AAACATACACACTG | TA 1 | N7 | Non-inflamed | Epi | N7.EpiA |
| 1 | N7.EpiA.AAACCGTGCATCAG | TA 1 | N7 | Non-inflamed | Epi | N7.EpiA |
| 2 | N7.EpiA.AAACGCACAATCGC | TA 2 | N7 | Non-inflamed | Epi | N7.EpiA |
| 3 | N7.EpiA.AAAGATCTAACCGT | Enterocyte Progenitors | N7 | Non-inflamed | Epi | N7.EpiA |
| 4 | N7.EpiA.AAAGATCTAGGCGA | Enterocyte Progenitors | N7 | Non-inflamed | Epi | N7.EpiA |
| ... | ... | ... | ... | ... | ... | ... |
| 365487 | N110.LPB.TTTGGTTAGGATGGTC | Macrophages | N110 | Inflamed | LP | N110.LPB |
| 365488 | N110.LPB.TTTGGTTCACCTCGTT | Plasma | N110 | Inflamed | LP | N110.LPB |
| 365489 | N110.LPB.TTTGGTTTCGGAAACG | Macrophages | N110 | Inflamed | LP | N110.LPB |
| 365490 | N110.LPB.TTTGTCAGTTGACGTT | Macrophages | N110 | Inflamed | LP | N110.LPB |
| 365491 | N110.LPB.TTTGTCATCTGACCTC | CD69+ Mast | N110 | Inflamed | LP | N110.LPB |
365492 rows × 6 columns
We also add another CellType column that maps the fine-grained cell types to coarse-grained cell types used for pseudobulking.
adata.obs = pd.merge(adata.obs, cell_subsets, on='Cluster')
adata.obs
| NAME | Cluster | Subject | Health | Location | Sample | CellType | |
|---|---|---|---|---|---|---|---|
| 0 | N7.EpiA.AAACATACACACTG | TA 1 | N7 | Non-inflamed | Epi | N7.EpiA | Epithelial |
| 1 | N7.EpiA.AAACCGTGCATCAG | TA 1 | N7 | Non-inflamed | Epi | N7.EpiA | Epithelial |
| 2 | N7.EpiA.AAACGCACAATCGC | TA 2 | N7 | Non-inflamed | Epi | N7.EpiA | Epithelial |
| 3 | N7.EpiA.AAAGATCTAACCGT | Enterocyte Progenitors | N7 | Non-inflamed | Epi | N7.EpiA | Epithelial |
| 4 | N7.EpiA.AAAGATCTAGGCGA | Enterocyte Progenitors | N7 | Non-inflamed | Epi | N7.EpiA | Epithelial |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 365487 | N110.LPB.TTTGGTTAGGATGGTC | Macrophages | N110 | Inflamed | LP | N110.LPB | Myeloid |
| 365488 | N110.LPB.TTTGGTTCACCTCGTT | Plasma | N110 | Inflamed | LP | N110.LPB | B_cells |
| 365489 | N110.LPB.TTTGGTTTCGGAAACG | Macrophages | N110 | Inflamed | LP | N110.LPB | Myeloid |
| 365490 | N110.LPB.TTTGTCAGTTGACGTT | Macrophages | N110 | Inflamed | LP | N110.LPB | Myeloid |
| 365491 | N110.LPB.TTTGTCATCTGACCTC | CD69+ Mast | N110 | Inflamed | LP | N110.LPB | Myeloid |
365492 rows × 7 columns
for i in range(len(adata.obs)):
assert adata.obs.iloc[i, :]['Subject'] in adata.obs.iloc[i, :]['Sample']
Since the samples are named according to subject, we can perform pseudobulking directly on cell types, samples and health conditions without having to separately group expression data by subject.
Note that in this tutorial we are performing pseudobulking on coarse-grained cell types, but depending on the experiment conducted and the hypotheses you would like to test, you might want to consider subsetting the expression data of certain fine-grained cell types and performing pseudobulking on them.
health_conditions = list(adata.obs['Health'].value_counts().index)
samples = list(adata.obs['Sample'].value_counts().index)
cell_types = list(adata.obs['CellType'].value_counts().index)
pseudobulk_dict = {}
for sample in samples:
for health_condition in health_conditions:
for cell_type in cell_types:
adata_subset = adata[(adata.obs['Sample'] == sample) & (adata.obs['Health'] == health_condition) & (adata.obs['CellType'] == cell_type)]
if len(adata_subset) == 0:
continue
gene_expression = adata_subset.X
summed_gene_expression = gene_expression.sum(axis=0)
pseudobulk_dict[f'{sample}_{health_condition}_{cell_type}'] = np.asarray(summed_gene_expression).reshape(-1)
pseudobulk_data = pd.DataFrame.from_dict(pseudobulk_dict)
pseudobulk_data.index = adata.var_names
pseudobulk_data
| N58.LPB1_Inflamed_Epithelial | N58.LPB1_Inflamed_B_cells | N58.LPB1_Inflamed_T_cells | N58.LPB1_Inflamed_Myeloid | N58.LPB1_Inflamed_Fibroblasts | N58.LPB1_Inflamed_Endothelial | N58.LPB1_Inflamed_Glia | N111.LPB1_Inflamed_Epithelial | N111.LPB1_Inflamed_B_cells | N111.LPB1_Inflamed_T_cells | ... | N52.EpiA2b_Non-inflamed_Epithelial | N52.EpiA2b_Non-inflamed_T_cells | N52.EpiA2b_Non-inflamed_Myeloid | N52.EpiA2a_Non-inflamed_Epithelial | N52.EpiA2a_Non-inflamed_T_cells | N52.EpiA2a_Non-inflamed_Myeloid | N58.EpiB2_Inflamed_Epithelial | N58.EpiB2_Inflamed_T_cells | N49.EpiA_Non-inflamed_Epithelial | N49.EpiA_Non-inflamed_Myeloid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7SK | 5 | 55 | 2 | 3 | 1 | 1 | 0 | 0 | 71 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A1BG | 2 | 13 | 0 | 1 | 1 | 1 | 0 | 0 | 12 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A1BG-AS1 | 2 | 43 | 0 | 8 | 8 | 1 | 0 | 2 | 60 | 20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A1CF | 52 | 1 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | ... | 33 | 0 | 0 | 27 | 0 | 0 | 0 | 0 | 1 | 0 |
| A2M | 32 | 52 | 4 | 95 | 1262 | 613 | 4 | 10 | 71 | 34 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| hsa-mir-5571 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| hsa-mir-6080 | 29 | 11 | 0 | 4 | 6 | 5 | 1 | 17 | 24 | 16 | ... | 15 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 |
| hsa-mir-8072 | 11 | 94 | 2 | 6 | 14 | 2 | 0 | 8 | 117 | 10 | ... | 4 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| snoU109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| snoU13 | 2 | 13 | 0 | 0 | 1 | 1 | 0 | 1 | 9 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21784 rows × 722 columns
pseudobulk_adata = ad.AnnData(pseudobulk_data.transpose())
pseudobulk_adata
AnnData object with n_obs × n_vars = 722 × 21784
As noted here, COMPASS takes in normalized gene expression data. We do so using the normalization method provided by Scanpy.
pseudobulk_adata.layers['counts'] = pseudobulk_adata.X.copy() # preserve counts
# normalize counts per cell
sc.pp.normalize_total(pseudobulk_adata, target_sum=1e4)
pseudobulk_adata.layers['norm_counts'] = pseudobulk_adata.X.copy()
pseudobulk_adata.layers['norm_counts']
array([[0.0114797 , 0.00459188, 0.00459188, ..., 0.02525535, 0. ,
0.00459188],
[0.03890552, 0.00919585, 0.03041704, ..., 0.06649307, 0. ,
0.00919585],
[0.05796077, 0. , 0. , ..., 0.05796077, 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ]], dtype=float32)
Note that we need to transpose the DataFrame constructed from the AnnData object since COMPASS takes in a genes x cells matrix.
normalized_pseudobulk_data = pd.DataFrame(pseudobulk_adata.layers['norm_counts'],
index=pseudobulk_adata.obs_names,
columns=pseudobulk_adata.var_names).transpose()
normalized_pseudobulk_data
| N58.LPB1_Inflamed_Epithelial | N58.LPB1_Inflamed_B_cells | N58.LPB1_Inflamed_T_cells | N58.LPB1_Inflamed_Myeloid | N58.LPB1_Inflamed_Fibroblasts | N58.LPB1_Inflamed_Endothelial | N58.LPB1_Inflamed_Glia | N111.LPB1_Inflamed_Epithelial | N111.LPB1_Inflamed_B_cells | N111.LPB1_Inflamed_T_cells | ... | N52.EpiA2b_Non-inflamed_Epithelial | N52.EpiA2b_Non-inflamed_T_cells | N52.EpiA2b_Non-inflamed_Myeloid | N52.EpiA2a_Non-inflamed_Epithelial | N52.EpiA2a_Non-inflamed_T_cells | N52.EpiA2a_Non-inflamed_Myeloid | N58.EpiB2_Inflamed_Epithelial | N58.EpiB2_Inflamed_T_cells | N49.EpiA_Non-inflamed_Epithelial | N49.EpiA_Non-inflamed_Myeloid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7SK | 0.011480 | 0.038906 | 0.057961 | 0.026216 | 0.004709 | 0.014418 | 0.000000 | 0.000000 | 0.040688 | 0.007955 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| A1BG | 0.004592 | 0.009196 | 0.000000 | 0.008739 | 0.004709 | 0.014418 | 0.000000 | 0.000000 | 0.006877 | 0.015909 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| A1BG-AS1 | 0.004592 | 0.030417 | 0.000000 | 0.069911 | 0.037673 | 0.014418 | 0.000000 | 0.009021 | 0.034384 | 0.039773 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| A1CF | 0.119389 | 0.000707 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.090212 | 0.000000 | 0.000000 | ... | 0.259618 | 0.0 | 0.0 | 0.213429 | 0.0 | 0.0 | 0.0 | 0.0 | 0.068964 | 0.0 |
| A2M | 0.073470 | 0.036783 | 0.115922 | 0.830189 | 5.942971 | 8.838011 | 2.314681 | 0.045106 | 0.040688 | 0.067614 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| hsa-mir-5571 | 0.000000 | 0.003537 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| hsa-mir-6080 | 0.066582 | 0.007781 | 0.000000 | 0.034955 | 0.028255 | 0.072088 | 0.578670 | 0.076680 | 0.013754 | 0.031818 | ... | 0.118008 | 0.0 | 0.0 | 0.094858 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| hsa-mir-8072 | 0.025255 | 0.066493 | 0.057961 | 0.052433 | 0.065928 | 0.028835 | 0.000000 | 0.036085 | 0.067050 | 0.019886 | ... | 0.031469 | 0.0 | 0.0 | 0.031619 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| snoU109 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| snoU13 | 0.004592 | 0.009196 | 0.000000 | 0.000000 | 0.004709 | 0.014418 | 0.000000 | 0.004511 | 0.005158 | 0.007955 | ... | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
21784 rows × 722 columns
normalized_pseudobulk_data.to_csv('ibd_expression_data.tsv', sep='\t')
Run COMPASS#
Now that we’ve performed pseudobulking, we can move on to run COMPASS on our pseudobulked dataset. This should take around 2-3 hours on 50 processors, although the runtime may vary depending on the system.
if os.path.exists('compass_results') == False:
os.mkdir('compass_results')
notebook_directory = os.getcwd()
os.chdir('/home/eecs/charleschien101/Compass')
os.system(f'python -m compass.main \
--data {notebook_directory}/ibd_expression_data.tsv \
--model RECON2_mat --species homo_sapiens --media default-media \
--output-dir {notebook_directory}/compass_results --num-processes 50 \
--select-reactions /home/eecs/charleschien101/Compass/compass/Resources/Metabolic\ Models/RECON2_mat/model/core_reactions.txt')
Cache for model and media already built
Evaluating Reaction Penalties...
Processing 722 samples using 50 processes
Progress bar will update once the first sample is finished
Parse COMPASS results#
To put things in perspective, running COMPASS on just the subset of epithelial cells in healthy patients of the IBD dataset (around 50,000 samples) takes approximately 1 week. When dealing with large-scale datasets, pseudobulking provides a clear advantage in terms of runtime, albeit at the expense of accuracy. In this section of the tutorial we compare the accuracy of results generated using pseudobulked data against the true results.
full_compass_results = pd.read_csv('/data/yosef2/scratch/users/charleschien101/test_turbo_compass_ibd/full_compass/reactions.tsv', sep='\t', index_col=0)
full_compass_results
| N10.EpiA.AAACATACAACCAC | N10.EpiA.AAACATACAGGCGA | N10.EpiA.AAACATACCACTAG | N10.EpiA.AAACATACCCTTTA | N10.EpiA.AAACATACTGCAAC | N10.EpiA.AAACATTGTCCAGA | N10.EpiA.AAACCGTGTCACGA | N10.EpiA.AAACGCACAGCTCA | N10.EpiA.AAACGCACCTACGA | N10.EpiA.AAACGCACCTTGCC | ... | N46.LPB.TGTGGTAGTCTGATTG | N46.LPB.TGTGTTTCAAGGCTCC | N46.LPB.TGTGTTTGTCCAGTGC | N46.LPB.TTAGGCAGTCGGGTCT | N46.LPB.TTGCGTCTCACCTCGT | N46.LPB.TTGCGTCTCAGCTCGG | N46.LPB.TTTGGTTTCAATCACG | N46.LPB.TTTGTCAAGTGTCTCA | N46.LPB.TTTGTCACACACTGCG | N46.LPB.TTTGTCATCTGTGCAA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13DAMPPOX_pos | 57855.385565 | 53540.040251 | 60383.686726 | 55606.682922 | 54643.181945 | 57076.496828 | 59779.499151 | 56867.966581 | 51758.645322 | 58702.171987 | ... | 57451.641482 | 51942.850536 | 50622.822572 | 58210.616649 | 55838.119014 | 52735.172090 | 56491.623389 | 50884.852104 | 57798.525670 | 59061.428351 |
| 24_25VITD2Hm_pos | 14.226549 | 13.705917 | 14.223131 | 13.976815 | 13.465809 | 13.742612 | 14.401317 | 13.858325 | 13.469690 | 14.219898 | ... | 13.969429 | 13.513077 | 13.395237 | 13.713766 | 13.816277 | 13.601905 | 14.017772 | 13.115538 | 13.924589 | 14.197245 |
| 24_25VITD3Hm_pos | 85596.932064 | 81527.063439 | 89380.927981 | 84567.036282 | 82939.465090 | 84773.376379 | 89973.880496 | 87420.111465 | 78403.676584 | 87178.144357 | ... | 85839.111382 | 81414.916101 | 79839.028138 | 87416.351415 | 85071.183232 | 80384.005976 | 82658.531461 | 78924.417430 | 88073.461489 | 88950.191146 |
| 25HVITD3c_pos | 84003.542293 | 80591.980324 | 88479.585549 | 83548.765195 | 81932.648697 | 84185.303284 | 88232.951360 | 85736.267470 | 77029.935910 | 86166.818782 | ... | 84687.625302 | 80341.110087 | 78739.941830 | 86798.608288 | 83836.534566 | 79293.432073 | 81735.659515 | 77825.730557 | 87271.851363 | 87341.467812 |
| 25HVITD3tin_pos | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | ... | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XYLK_pos | 14285.584118 | 12604.589330 | 15363.725454 | 13649.736824 | 13856.550675 | 14201.644892 | 15217.604903 | 14894.645338 | 12248.068571 | 14991.047457 | ... | 14526.227651 | 13366.546337 | 12034.360348 | 14454.883894 | 14855.031146 | 13012.086231 | 14054.094722 | 13085.385614 | 15728.964226 | 14265.409850 |
| XYLUR_neg | 3528.721199 | 2122.048918 | 2824.192987 | 2667.753557 | 1692.309736 | 3029.564452 | 3537.493243 | 3137.896702 | 2625.510477 | 3231.854120 | ... | 1799.166564 | 2043.221613 | 2170.574572 | 3571.497166 | 2757.338510 | 2130.712431 | 2645.565549 | 2705.662200 | 3177.222631 | 3535.833133 |
| XYLUR_pos | 3800.000000 | 2122.048918 | 3074.417763 | 2667.753557 | 2423.610209 | 3029.564452 | 3800.000000 | 3800.000000 | 2679.956841 | 3146.449156 | ... | 1799.166564 | 2446.203768 | 2340.207683 | 3800.000000 | 3278.669255 | 2516.051438 | 2645.565549 | 3196.419928 | 3177.222631 | 3800.000000 |
| XYLt_neg | 1329.848580 | 1319.417808 | 1415.845544 | 1378.119179 | 1305.989797 | 1345.128020 | 1386.968880 | 1291.539796 | 1270.162781 | 1386.403095 | ... | 1374.369848 | 1314.500673 | 1284.486703 | 1367.109233 | 1263.640187 | 1306.715634 | 1327.753417 | 1227.197406 | 1399.660123 | 1425.123760 |
| XYLt_pos | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2321 rows × 50258 columns
pseudobulk_results = pd.read_csv('/data/yosef2/scratch/users/charleschien101/test_pseudobulk/reactions.tsv', sep='\t', index_col=0)
pseudobulk_results
| N58.LPB1_Inflamed_Epithelial | N58.LPB1_Inflamed_B_cells | N58.LPB1_Inflamed_T_cells | N58.LPB1_Inflamed_Myeloid | N58.LPB1_Inflamed_Fibroblasts | N58.LPB1_Inflamed_Endothelial | N58.LPB1_Inflamed_Glia | N111.LPB1_Inflamed_Epithelial | N111.LPB1_Inflamed_B_cells | N111.LPB1_Inflamed_T_cells | ... | N52.EpiA2b_Non-inflamed_Epithelial | N52.EpiA2b_Non-inflamed_T_cells | N52.EpiA2b_Non-inflamed_Myeloid | N52.EpiA2a_Non-inflamed_Epithelial | N52.EpiA2a_Non-inflamed_T_cells | N52.EpiA2a_Non-inflamed_Myeloid | N58.EpiB2_Inflamed_Epithelial | N58.EpiB2_Inflamed_T_cells | N49.EpiA_Non-inflamed_Epithelial | N49.EpiA_Non-inflamed_Myeloid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13DAMPPOX_pos | 49954.234254 | 55387.683662 | 54572.634704 | 52767.750703 | 52490.838421 | 52060.605878 | 53784.468299 | 49437.018820 | 52405.655557 | 53793.992131 | ... | 49848.043351 | 57328.136101 | 55429.211530 | 49915.452036 | 57903.182004 | 55947.945175 | 51811.748630 | 60275.637302 | 49542.055989 | 58549.425050 |
| 24_25VITD2Hm_pos | 13.464853 | 14.126539 | 14.023334 | 13.819175 | 13.879333 | 13.828841 | 13.825165 | 13.421621 | 13.791743 | 14.017318 | ... | 13.458125 | 14.036519 | 13.959093 | 13.462110 | 14.006164 | 14.028159 | 13.484712 | 14.424876 | 13.205290 | 14.387264 |
| 24_25VITD3Hm_pos | 77120.089436 | 83816.201783 | 82833.923463 | 80874.806621 | 80537.798690 | 79389.260164 | 83472.353529 | 76339.810402 | 81568.268602 | 82374.248585 | ... | 77291.364242 | 83515.216203 | 84439.784693 | 77416.918560 | 83568.202506 | 84489.677245 | 79976.881278 | 89145.273961 | 78268.023754 | 85571.958996 |
| 25HVITD3c_pos | 75633.989004 | 82385.420150 | 81289.795371 | 79259.289258 | 79117.093928 | 77786.589317 | 82142.843942 | 74911.266351 | 79986.965202 | 80793.604223 | ... | 75961.428109 | 82379.768476 | 83743.636238 | 76100.288426 | 82624.705927 | 83733.874969 | 78545.700273 | 87993.627853 | 76665.749651 | 84130.837613 |
| 25HVITD3tin_pos | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | ... | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XYLK_pos | 12118.762031 | 13917.227516 | 13375.529939 | 12952.395353 | 12756.160657 | 12706.940298 | 13661.519211 | 12197.786692 | 12687.007632 | 12984.798456 | ... | 12524.019304 | 13714.611863 | 13978.750586 | 12523.019007 | 13720.606099 | 13975.015167 | 12497.851923 | 15333.983649 | 12642.711910 | 13933.554471 |
| XYLUR_neg | 2331.836903 | 2665.335682 | 2680.906234 | 2488.652979 | 2235.660378 | 2418.453592 | 2191.699957 | 2247.085833 | 2549.530933 | 2512.771006 | ... | 2205.177494 | 2852.800709 | 2385.969865 | 2217.171451 | 2860.863213 | 2382.916742 | 2327.505126 | 3800.000000 | 2348.296845 | 3583.709793 |
| XYLUR_pos | 2481.503033 | 2678.523386 | 2830.842746 | 2518.433610 | 2784.152103 | 2808.868955 | 2668.202897 | 2491.221834 | 2623.773706 | 2764.020500 | ... | 2685.095954 | 2852.800709 | 2385.969865 | 2689.050199 | 2860.863213 | 2382.916742 | 2832.472272 | 3800.000000 | 2676.325344 | 3198.834896 |
| XYLt_neg | 1267.150935 | 1335.123149 | 1324.270558 | 1293.681380 | 1286.364376 | 1290.507136 | 1362.843997 | 1255.876859 | 1320.498073 | 1336.465403 | ... | 1251.144638 | 1366.725298 | 1366.949157 | 1255.505012 | 1366.589844 | 1366.337552 | 1292.777263 | 1436.321062 | 1189.018510 | 1384.056095 |
| XYLt_pos | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2321 rows × 722 columns
We now group the output of a full COMPASS run by sample name. Note that all single-cell samples are from epithelial cells of healthy patients.
healthy_sample_names = list(meta[meta['Health'] == 'Healthy']['Sample'].value_counts().index)
healthy_sample_names.sort()
healthy_sample_results = []
for sample in healthy_sample_names:
sample_results = full_compass_results.loc[:, [sample in col_name for col_name in full_compass_results.columns]]
healthy_sample_results.append(sample_results.mean(axis=1))
healthy_sample_results = pd.concat(healthy_sample_results, axis=1)
healthy_sample_results.columns = healthy_sample_names
healthy_sample_results
| N10.EpiA | N10.EpiB | N10.LPA | N10.LPB | N11.EpiA | N11.EpiB | N11.LPA | N11.LPB | N13.EpiA | N13.EpiB | ... | N46.LPA | N46.LPB | N51.EpiA | N51.EpiB | N51.LPA | N51.LPB | N8.EpiA | N8.EpiB | N8.LPA | N8.LPB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13DAMPPOX_pos | 55578.651407 | 55470.906657 | 55570.125999 | 54937.676044 | 56409.997348 | 55792.982908 | 57867.482346 | 56196.925440 | 57330.371846 | 57669.291929 | ... | 55563.618092 | 55969.642611 | 57521.978818 | 57387.069151 | 56167.925784 | 55667.192442 | 56302.202412 | 56113.470813 | 55867.806369 | 55173.565896 |
| 24_25VITD2Hm_pos | 13.794118 | 13.786143 | 13.906611 | 13.767245 | 13.839933 | 13.785692 | 13.905707 | 13.939658 | 13.987922 | 14.019336 | ... | 13.796592 | 13.902185 | 13.994058 | 13.953505 | 13.952344 | 13.945846 | 13.891757 | 13.801656 | 14.064391 | 13.673429 |
| 24_25VITD3Hm_pos | 84117.775411 | 84111.531511 | 83949.984523 | 83232.923065 | 85452.174782 | 84469.698507 | 86750.746832 | 85967.878973 | 86270.784953 | 86650.814947 | ... | 83848.265999 | 84483.224151 | 86674.963094 | 86555.553001 | 84287.645195 | 84185.205660 | 84952.776311 | 84918.778293 | 85545.018142 | 83921.851513 |
| 25HVITD3c_pos | 82883.973050 | 82887.339652 | 82777.774394 | 82098.506352 | 84148.548547 | 83191.415597 | 85636.908326 | 84467.929709 | 84993.473551 | 85398.494273 | ... | 82617.503449 | 83277.654509 | 85417.810845 | 85255.642011 | 83188.298978 | 82968.844782 | 83735.737702 | 83712.568804 | 84260.273090 | 82625.072640 |
| 25HVITD3tin_pos | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | ... | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XYLK_pos | 14010.427866 | 13986.919476 | 13785.947249 | 13636.181052 | 14474.763705 | 14222.559889 | 14943.910622 | 14417.083814 | 14726.182271 | 14851.123014 | ... | 14088.346260 | 14217.350621 | 14881.704324 | 14864.353245 | 14135.708826 | 14071.461771 | 14254.983728 | 14253.001346 | 14223.174490 | 14107.211083 |
| XYLUR_neg | 2602.564929 | 2592.879784 | 2600.958901 | 2524.247729 | 2891.252037 | 2789.764038 | 2684.096337 | 2533.061741 | 2967.033166 | 2993.576352 | ... | 2820.973648 | 2830.751876 | 3057.997362 | 3095.029403 | 2820.646688 | 2728.318388 | 2711.289020 | 2778.516430 | 2869.199850 | 2726.186203 |
| XYLUR_pos | 2793.201920 | 2793.894955 | 2846.047038 | 2730.287980 | 3087.597109 | 2998.724721 | 3103.006133 | 3145.686305 | 3176.637878 | 3207.594953 | ... | 3037.752325 | 3022.542954 | 3269.881138 | 3308.776530 | 2966.110408 | 2952.381184 | 2956.205529 | 3023.554944 | 3034.593677 | 2926.450454 |
| XYLt_neg | 1325.945409 | 1325.073590 | 1345.795887 | 1335.052432 | 1329.232259 | 1319.555192 | 1391.727385 | 1347.302068 | 1367.081516 | 1370.724405 | ... | 1335.126862 | 1353.444657 | 1370.810047 | 1363.184043 | 1359.446151 | 1357.661669 | 1345.801000 | 1338.053292 | 1374.938106 | 1287.444906 |
| XYLt_pos | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2321 rows × 48 columns
We now turn to the COMPASS reaction scores generated from pseudobulked data and subset reaction scores from bulk samples that belong to epithelial cells of healthy patients.
healthy_pseudobulk_results = pseudobulk_results.loc[:, [('Healthy' in col_name) and ('Epithelial' in col_name) for col_name in pseudobulk_results.columns]]
healthy_pseudobulk_results = healthy_pseudobulk_results.reindex(sorted(healthy_pseudobulk_results.columns), axis=1)
healthy_pseudobulk_results
| N10.EpiA_Healthy_Epithelial | N10.EpiB_Healthy_Epithelial | N10.LPA_Healthy_Epithelial | N10.LPB_Healthy_Epithelial | N11.EpiA_Healthy_Epithelial | N11.EpiB_Healthy_Epithelial | N11.LPA_Healthy_Epithelial | N11.LPB_Healthy_Epithelial | N13.EpiA_Healthy_Epithelial | N13.EpiB_Healthy_Epithelial | ... | N46.LPA_Healthy_Epithelial | N46.LPB_Healthy_Epithelial | N51.EpiA_Healthy_Epithelial | N51.EpiB_Healthy_Epithelial | N51.LPA_Healthy_Epithelial | N51.LPB_Healthy_Epithelial | N8.EpiA_Healthy_Epithelial | N8.EpiB_Healthy_Epithelial | N8.LPA_Healthy_Epithelial | N8.LPB_Healthy_Epithelial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13DAMPPOX_pos | 49343.703873 | 49366.543477 | 50041.539245 | 48896.101991 | 49915.199936 | 49544.521493 | 55252.822187 | 51724.697713 | 49292.800450 | 49725.463232 | ... | 49615.464446 | 49458.331646 | 49065.168090 | 48956.253915 | 50802.412675 | 50125.492275 | 49978.430510 | 50117.124899 | 52563.082215 | 50243.282087 |
| 24_25VITD2Hm_pos | 13.191137 | 13.202484 | 13.468908 | 13.305291 | 13.190822 | 13.171305 | 13.495858 | 13.373125 | 13.208837 | 13.242782 | ... | 13.231858 | 13.317501 | 13.110893 | 13.072062 | 13.540608 | 13.556481 | 13.269303 | 13.145874 | 13.691254 | 13.207760 |
| 24_25VITD3Hm_pos | 76853.774149 | 77066.630672 | 77339.140491 | 76737.705259 | 77427.698122 | 76820.455562 | 83737.866095 | 80858.519301 | 77016.536850 | 77171.895298 | ... | 77074.676834 | 77193.646905 | 76459.056213 | 76240.398839 | 78151.207576 | 77887.207890 | 77695.783513 | 77712.041731 | 81827.291025 | 78678.487240 |
| 25HVITD3c_pos | 75378.549906 | 75639.551367 | 76022.411801 | 75443.360618 | 75893.512459 | 75347.874830 | 82949.178039 | 79438.997588 | 75541.606542 | 75916.749105 | ... | 75700.927805 | 75831.984663 | 75214.726969 | 74966.300360 | 76907.534234 | 76506.953508 | 76310.966205 | 76373.965194 | 80478.975063 | 77357.180360 |
| 25HVITD3tin_pos | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | ... | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 | 1900.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XYLK_pos | 12057.939948 | 12080.690301 | 12004.401733 | 12057.076275 | 12208.845150 | 12184.657229 | 14292.701145 | 12979.042115 | 12207.686013 | 12317.468238 | ... | 12320.942860 | 12163.425350 | 12205.258151 | 12176.458619 | 12555.580720 | 12176.649491 | 12277.383367 | 12195.708797 | 13120.942342 | 12656.104737 |
| XYLUR_neg | 2000.222456 | 2029.411271 | 2087.799752 | 2069.266719 | 2111.078437 | 2070.863760 | 1893.915551 | 1891.415234 | 2167.688474 | 2140.014923 | ... | 2341.635492 | 2219.151900 | 2105.509925 | 2114.245069 | 2334.183506 | 2320.801652 | 2065.203448 | 2064.882459 | 2311.095617 | 2262.123789 |
| XYLUR_pos | 2520.287707 | 2534.215450 | 2598.169991 | 2481.268481 | 2590.286666 | 2582.026895 | 2484.274771 | 2515.142731 | 2636.772176 | 2672.278004 | ... | 2726.333630 | 2592.140867 | 2615.184949 | 2614.880058 | 2582.766166 | 2618.800524 | 2576.986465 | 2574.977737 | 2439.604600 | 2620.344193 |
| XYLt_neg | 1169.780792 | 1176.271699 | 1266.960914 | 1232.795134 | 1168.668379 | 1169.676019 | 1354.200067 | 1231.614154 | 1182.765693 | 1187.261424 | ... | 1196.303447 | 1232.240624 | 1190.140670 | 1186.010375 | 1279.905771 | 1282.794785 | 1184.830820 | 1181.022482 | 1310.780251 | 1178.624288 |
| XYLt_pos | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2321 rows × 48 columns
We ensure that the column names of the selected single cell samples and pseudobulk samples align correctly so that the comparison is fair.
for col_1, col_2 in zip(healthy_sample_results.columns, healthy_pseudobulk_results.columns):
assert col_1 in col_2
We define some helper functions that allow us to robustly compute the Spearman correlation.
NONCONSTANT_COLUMN_THRESHOLD = 1e-3
def are_equal(u: np.array, v: np.array) -> bool:
r"""
The business logic for determining whether two columns are equal lives here.
"""
return np.max(np.abs(u - v)) < NONCONSTANT_COLUMN_THRESHOLD
def is_nonconstant(v: np.array) -> bool:
r"""
The business logic for determining whether a feature is nonconstant lives here.
:param v: 1D array
:return: True if the column is considered nonconstant.
"""
return np.abs(np.nanmax(v) - np.nanmin(v)) > NONCONSTANT_COLUMN_THRESHOLD
def is_constant(v: np.array) -> bool:
r"""
The business logic for determining whether a feature is constant lives here.
:param v: 1D array
:return: True if the column is considered constant.
"""
return not is_nonconstant(v)
def compute_spearman_r2(
y_true: np.array,
y_pred: np.array) -> float:
r"""
Standard Spearman R2 correlation: a number between -1 and 1.
Note that:
- predicted columns that are almost equal to the ground truth are declared Spearman R2 1.0
- predicted columns that are not almost equal yet ground truth is constant, have Spearman R2 0.0.
- for all other columns, the usual Spearman R2 is computed
"""
assert(len(y_true) == len(y_pred))
# Border case 0: there are no true labels or predictions. Return 0.0 (random)
if len(y_true) == 0:
return 0.0
# Border case 1: predictions are very close to ground truth
if are_equal(y_true, y_pred):
return 1.0
# Border case 2: predictions are not very close to ground truth, yet ground truth is constant
if not are_equal(y_true, y_pred) and is_constant(y_true):
return 0.0
# Usual computation:
# Border case: all true values or predicted values are constant (but the other is not)
if np.std(y_true) == 0.0: # pragma: no cover
raise ValueError("This should never happen!")
if np.std(y_pred) == 0.0:
assert(np.std(y_true) != 0.0)
return 0.0
spearman_r2 = scipy.stats.spearmanr(y_true, y_pred).correlation
if np.isnan(spearman_r2): # pragma: no cover
return 0.0
return spearman_r2
def compute_spearman_r2s(
X_true: np.array,
X_pred: np.array)\
-> np.array:
r"""
Computes Spearman R2 accross all columns.
Note that:
- predicted columns that are almost equal to the ground truth are declared Spearman R2 1.0
- predicted columns that are not almost equal yet ground truth is constant, have Spearman R2 0.0.
- for all other columns, the usual Spearman R2 is computed
"""
nrows, ncols = X_true.shape
spearman_r2s = np.zeros(shape=(ncols))
for c in range(ncols):
y_true = X_true[:, c].flatten()
y_pred = X_pred[:, c].flatten()
spearman_r2 = compute_spearman_r2(y_true, y_pred)
spearman_r2s[c] = spearman_r2
return spearman_r2s
As shown below, Compass scores generated from pseudobulked samples achieve a median Spearman correlation of 0.53 when compared to the ground truth Compass scores generated from single-cell samples. This demonstrates that pseudobulking does indeed yield significant speedups (3 hours v.s. 1 week!) while maintaining high accuracy.
spearman_r2s = compute_spearman_r2s(healthy_sample_results.transpose().to_numpy(), healthy_pseudobulk_results.transpose().to_numpy())
print(f'Median Spearman correlation: {np.median(spearman_r2s)}')
plt.violinplot(spearman_r2s, quantiles=[0.5])
plt.show()
Median Spearman correlation: 0.5326747720364742