Compass Pseudobulking#

The runtime of Compass may become prohibitively long for large-scale single-cell RNA-seq datasets. To mitigate this issue, we suggest that users perform pseudobulking on the data before running Compass. This notebook demonstrates how pseudobulking can be performed and runs Compass on pseudobulked data.

The dataset used in this tutorial is from Smillie, C. S. et al. Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730.e22 (2019) (link). You can download the expression data from the Single Cell Portal.

The files you need are:

Metadata

all.meta2.txt

Epithelial data

Epi.genes.tsv
Epi.barcodes2.tsv
gene_sorted-Epi.matrix.mtx

Stromal data

Fib.genes.tsv
Fib.barcodes2.tsv
gene_sorted-Fib.matrix.mtx

Immune data

Imm.genes.tsv
Imm.barcodes2.tsv
gene_sorted-Imm.matrix.mtx

Also download the cell_subsets.txt file from the GitHub repo provided by this study.

Place the aforementioned files in the same directory as this notebook.

Requirements#

All of the required python packages can be installed using a cell at the start of the notebook. To install the required python packages, you can uncomment the “install_reqs()” call.

def install_reqs():
  !pip install pandas
  !pip install matplotlib
  !pip install numpy
  !pip install scipy
  !pip install anndata
  !pip install scanpy
#install_reqs()

Pseudobulking#

This section of the tutorial will demonstrate how to perform pseudobulking on a large scRNA-seq dataset. If you already have an AnnData object that you can process directly, you may move on to this section.

import pandas as pd
import numpy as np
from scipy.io import mmread
from scipy import sparse
from scipy.sparse import csr_matrix, vstack
import os
import matplotlib.pyplot as plt
import scipy.stats

import anndata as ad
import scanpy as sc

DATA_DIR = '/data/yosef2/scratch/users/charleschien101/ibd_data'

We read in the metadata file here. We will be performing pseudobulking based on the Health, Sample, and Cluster columns.

meta = pd.read_csv(os.path.join(DATA_DIR, 'all.meta2.txt'), sep='\t')
meta

	NAME	Cluster	nGene	nUMI	Subject	Health	Location	Sample
0	TYPE	group	numeric	numeric	group	group	group	group
1	N7.EpiA.AAACATACACACTG	TA 1	328	891	N7	Non-inflamed	Epi	N7.EpiA
2	N7.EpiA.AAACCGTGCATCAG	TA 1	257	663	N7	Non-inflamed	Epi	N7.EpiA
3	N7.EpiA.AAACGCACAATCGC	TA 2	300	639	N7	Non-inflamed	Epi	N7.EpiA
4	N7.EpiA.AAAGATCTAACCGT	Enterocyte Progenitors	250	649	N7	Non-inflamed	Epi	N7.EpiA
...	...	...	...	...	...	...	...	...
365488	N110.LPB.TTTGGTTAGGATGGTC	Macrophages	635	1366	N110	Inflamed	LP	N110.LPB
365489	N110.LPB.TTTGGTTCACCTCGTT	Plasma	610	2730	N110	Inflamed	LP	N110.LPB
365490	N110.LPB.TTTGGTTTCGGAAACG	Macrophages	859	1979	N110	Inflamed	LP	N110.LPB
365491	N110.LPB.TTTGTCAGTTGACGTT	Macrophages	965	2696	N110	Inflamed	LP	N110.LPB
365492	N110.LPB.TTTGTCATCTGACCTC	CD69+ Mast	559	1156	N110	Inflamed	LP	N110.LPB

365493 rows × 8 columns

The Cluster column corresponds to various cell types. Since the cell types provided in the Cluster column is rather fine-grained, we used the more coarse-grained cell types provided in the cell_subsets.txt file.

cell_subsets = pd.read_csv(os.path.join(DATA_DIR, 'cell_subsets.txt'), sep='\t', names=['Cluster', 'CellType'], header=None)
cell_subsets.head()

	Cluster	CellType
0	Stem	Epithelial
1	TA 1	Epithelial
2	TA 2	Epithelial
3	Cycling TA	Epithelial
4	Immature Enterocytes 1	Epithelial

The dataset used in this notebook separately stores expression data for epithelial cells, stromal cells, and immune cells. Here we read in the barcodes (cell names), gene names, and expression matrix.

epi_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Epi.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)
fib_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Fib.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)
imm_barcodes = pd.read_csv(os.path.join(DATA_DIR, 'Imm.barcodes2.tsv'), sep='\t', names=['barcodes'], header=None)

epi_genes = pd.read_csv(os.path.join(DATA_DIR, 'Epi.genes.tsv'), sep='\t', names=['genes'], header=None)
fib_genes = pd.read_csv(os.path.join(DATA_DIR, 'Fib.genes.tsv'), sep='\t', names=['genes'], header=None)
imm_genes = pd.read_csv(os.path.join(DATA_DIR, 'Imm.genes.tsv'), sep='\t', names=['genes'], header=None)

epi_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Epi.matrix.mtx')))
epi_matrix

<20028x123006 sparse matrix of type '<class 'numpy.int64'>'
	with 174423911 stored elements in COOrdinate format>

fib_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Fib.matrix.mtx')))
fib_matrix

<19076x31872 sparse matrix of type '<class 'numpy.int64'>'
	with 39087919 stored elements in COOrdinate format>

imm_matrix = (mmread(os.path.join(DATA_DIR, 'gene_sorted-Imm.matrix.mtx')))
imm_matrix

<20529x210614 sparse matrix of type '<class 'numpy.int64'>'
	with 173255714 stored elements in COOrdinate format>

Since the three expression matrices contain different numbers of genes, we need to fill in the missing gene expression data (with zeros) to be able to integrate the three datasets together. We also create AnnData objects from the expression data for easier data manipulation. If you are not familiar with AnnData objects, you can refer to the AnnData documentation.

all_genes = list(set.union(set(epi_genes['genes']), set(fib_genes['genes']), set(imm_genes['genes'])))
len(all_genes)

epi_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - epi_matrix.shape[0], epi_matrix.shape[1]]), dtype=np.int64)
epi_matrix_combined = vstack((epi_matrix, epi_matrix_zeros))
epi_matrix_combined

<21784x123006 sparse matrix of type '<class 'numpy.int64'>'
	with 174423911 stored elements in COOrdinate format>

epi_remaining_genes = list(set(all_genes) - set(epi_genes['genes']))

epi_adata = ad.AnnData(epi_matrix_combined.tocsr().transpose())
epi_adata.obs_names = list(epi_barcodes['barcodes'])
epi_adata.var_names = list(epi_genes['genes']) + epi_remaining_genes
epi_adata = epi_adata[:, epi_adata.var_names.sort_values()]
epi_adata

View of AnnData object with n_obs × n_vars = 123006 × 21784

fib_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - fib_matrix.shape[0], fib_matrix.shape[1]]), dtype=np.int64)
fib_matrix_combined = vstack((fib_matrix, fib_matrix_zeros))
fib_matrix_combined

<21784x31872 sparse matrix of type '<class 'numpy.int64'>'
	with 39087919 stored elements in COOrdinate format>

fib_remaining_genes = list(set(all_genes) - set(fib_genes['genes']))

fib_adata = ad.AnnData(fib_matrix_combined.tocsr().transpose())
fib_adata.obs_names = list(fib_barcodes['barcodes'])
fib_adata.var_names = list(fib_genes['genes']) + fib_remaining_genes
fib_adata = fib_adata[:, fib_adata.var_names.sort_values()]
fib_adata

View of AnnData object with n_obs × n_vars = 31872 × 21784

imm_matrix_zeros = csr_matrix(np.zeros([len(all_genes) - imm_matrix.shape[0], imm_matrix.shape[1]]), dtype=np.int64)
imm_matrix_combined = vstack((imm_matrix, imm_matrix_zeros))
imm_matrix_combined

<21784x210614 sparse matrix of type '<class 'numpy.int64'>'
	with 173255714 stored elements in COOrdinate format>

imm_remaining_genes = list(set(all_genes) - set(imm_genes['genes']))

imm_adata = ad.AnnData(imm_matrix_combined.tocsr().transpose())
imm_adata.obs_names = list(imm_barcodes['barcodes'])
imm_adata.var_names = list(imm_genes['genes']) + imm_remaining_genes
imm_adata = imm_adata[:, imm_adata.var_names.sort_values()]
imm_adata

View of AnnData object with n_obs × n_vars = 210614 × 21784

epi_var_names = list(epi_adata.var_names)
fib_var_names = list(fib_adata.var_names)
imm_var_names = list(imm_adata.var_names)

assert epi_var_names == fib_var_names
assert epi_var_names == imm_var_names

After ensuring that all three expression matrices contain the same genes, we concatenate the three AnnData objects together.

adata = ad.concat([epi_adata, fib_adata, imm_adata])
adata

AnnData object with n_obs × n_vars = 365492 × 21784

We confirm that the cells in the AnnData object are arranged in the same order as in the metadata file. This allows us to easily integrate metadata into the AnnData object.

assert list(adata.obs_names) == list(meta['NAME'][1:])

adata.obs = meta.iloc[1:, ][['NAME', 'Cluster', 'Subject', 'Health', 'Location', 'Sample']].reset_index(drop=True)
adata.obs

	NAME	Cluster	Subject	Health	Location	Sample
0	N7.EpiA.AAACATACACACTG	TA 1	N7	Non-inflamed	Epi	N7.EpiA
1	N7.EpiA.AAACCGTGCATCAG	TA 1	N7	Non-inflamed	Epi	N7.EpiA
2	N7.EpiA.AAACGCACAATCGC	TA 2	N7	Non-inflamed	Epi	N7.EpiA
3	N7.EpiA.AAAGATCTAACCGT	Enterocyte Progenitors	N7	Non-inflamed	Epi	N7.EpiA
4	N7.EpiA.AAAGATCTAGGCGA	Enterocyte Progenitors	N7	Non-inflamed	Epi	N7.EpiA
...	...	...	...	...	...	...
365487	N110.LPB.TTTGGTTAGGATGGTC	Macrophages	N110	Inflamed	LP	N110.LPB
365488	N110.LPB.TTTGGTTCACCTCGTT	Plasma	N110	Inflamed	LP	N110.LPB
365489	N110.LPB.TTTGGTTTCGGAAACG	Macrophages	N110	Inflamed	LP	N110.LPB
365490	N110.LPB.TTTGTCAGTTGACGTT	Macrophages	N110	Inflamed	LP	N110.LPB
365491	N110.LPB.TTTGTCATCTGACCTC	CD69+ Mast	N110	Inflamed	LP	N110.LPB

365492 rows × 6 columns

We also add another CellType column that maps the fine-grained cell types to coarse-grained cell types used for pseudobulking.

adata.obs = pd.merge(adata.obs, cell_subsets, on='Cluster')
adata.obs

	NAME	Cluster	Subject	Health	Location	Sample	CellType
0	N7.EpiA.AAACATACACACTG	TA 1	N7	Non-inflamed	Epi	N7.EpiA	Epithelial
1	N7.EpiA.AAACCGTGCATCAG	TA 1	N7	Non-inflamed	Epi	N7.EpiA	Epithelial
2	N7.EpiA.AAACGCACAATCGC	TA 2	N7	Non-inflamed	Epi	N7.EpiA	Epithelial
3	N7.EpiA.AAAGATCTAACCGT	Enterocyte Progenitors	N7	Non-inflamed	Epi	N7.EpiA	Epithelial
4	N7.EpiA.AAAGATCTAGGCGA	Enterocyte Progenitors	N7	Non-inflamed	Epi	N7.EpiA	Epithelial
...	...	...	...	...	...	...	...
365487	N110.LPB.TTTGGTTAGGATGGTC	Macrophages	N110	Inflamed	LP	N110.LPB	Myeloid
365488	N110.LPB.TTTGGTTCACCTCGTT	Plasma	N110	Inflamed	LP	N110.LPB	B_cells
365489	N110.LPB.TTTGGTTTCGGAAACG	Macrophages	N110	Inflamed	LP	N110.LPB	Myeloid
365490	N110.LPB.TTTGTCAGTTGACGTT	Macrophages	N110	Inflamed	LP	N110.LPB	Myeloid
365491	N110.LPB.TTTGTCATCTGACCTC	CD69+ Mast	N110	Inflamed	LP	N110.LPB	Myeloid

365492 rows × 7 columns

for i in range(len(adata.obs)):
    assert adata.obs.iloc[i, :]['Subject'] in adata.obs.iloc[i, :]['Sample']

Since the samples are named according to subject, we can perform pseudobulking directly on cell types, samples and health conditions without having to separately group expression data by subject.

Note that in this tutorial we are performing pseudobulking on coarse-grained cell types, but depending on the experiment conducted and the hypotheses you would like to test, you might want to consider subsetting the expression data of certain fine-grained cell types and performing pseudobulking on them.

health_conditions = list(adata.obs['Health'].value_counts().index)
samples = list(adata.obs['Sample'].value_counts().index)
cell_types = list(adata.obs['CellType'].value_counts().index)

pseudobulk_dict = {}

for sample in samples:
    for health_condition in health_conditions:
        for cell_type in cell_types:
            adata_subset = adata[(adata.obs['Sample'] == sample) & (adata.obs['Health'] == health_condition) & (adata.obs['CellType'] == cell_type)]
            if len(adata_subset) == 0:
                continue

            gene_expression = adata_subset.X
            summed_gene_expression = gene_expression.sum(axis=0)

            pseudobulk_dict[f'{sample}_{health_condition}_{cell_type}'] = np.asarray(summed_gene_expression).reshape(-1)

pseudobulk_data = pd.DataFrame.from_dict(pseudobulk_dict)
pseudobulk_data.index = adata.var_names
pseudobulk_data

	N58.LPB1_Inflamed_Epithelial	N58.LPB1_Inflamed_B_cells	N58.LPB1_Inflamed_T_cells	N58.LPB1_Inflamed_Myeloid	N58.LPB1_Inflamed_Fibroblasts	N58.LPB1_Inflamed_Endothelial	N58.LPB1_Inflamed_Glia	N111.LPB1_Inflamed_Epithelial	N111.LPB1_Inflamed_B_cells	N111.LPB1_Inflamed_T_cells	...	N52.EpiA2b_Non-inflamed_Epithelial	N52.EpiA2b_Non-inflamed_T_cells	N52.EpiA2b_Non-inflamed_Myeloid	N52.EpiA2a_Non-inflamed_Epithelial	N52.EpiA2a_Non-inflamed_T_cells	N52.EpiA2a_Non-inflamed_Myeloid	N58.EpiB2_Inflamed_Epithelial	N58.EpiB2_Inflamed_T_cells	N49.EpiA_Non-inflamed_Epithelial	N49.EpiA_Non-inflamed_Myeloid
7SK	5	55	2	3	1	1	0	0	71	4	...	0	0	0	0	0	0	0	0	0	0
A1BG	2	13	0	1	1	1	0	0	12	8	...	0	0	0	0	0	0	0	0	0	0
A1BG-AS1	2	43	0	8	8	1	0	2	60	20	...	0	0	0	0	0	0	0	0	0	0
A1CF	52	1	0	0	0	0	0	20	0	0	...	33	0	0	27	0	0	0	0	1	0
A2M	32	52	4	95	1262	613	4	10	71	34	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
hsa-mir-5571	0	5	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
hsa-mir-6080	29	11	0	4	6	5	1	17	24	16	...	15	0	0	12	0	0	0	0	0	0
hsa-mir-8072	11	94	2	6	14	2	0	8	117	10	...	4	0	0	4	0	0	0	0	0	0
snoU109	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
snoU13	2	13	0	0	1	1	0	1	9	4	...	0	0	0	0	0	0	0	0	0	0

21784 rows × 722 columns

pseudobulk_adata = ad.AnnData(pseudobulk_data.transpose())
pseudobulk_adata

AnnData object with n_obs × n_vars = 722 × 21784

As noted here, COMPASS takes in normalized gene expression data. We do so using the normalization method provided by Scanpy.

pseudobulk_adata.layers['counts'] = pseudobulk_adata.X.copy()  # preserve counts
# normalize counts per cell
sc.pp.normalize_total(pseudobulk_adata, target_sum=1e4)
pseudobulk_adata.layers['norm_counts'] = pseudobulk_adata.X.copy()
pseudobulk_adata.layers['norm_counts']

array([[0.0114797 , 0.00459188, 0.00459188, ..., 0.02525535, 0.        ,
        0.00459188],
       [0.03890552, 0.00919585, 0.03041704, ..., 0.06649307, 0.        ,
        0.00919585],
       [0.05796077, 0.        , 0.        , ..., 0.05796077, 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], dtype=float32)

Note that we need to transpose the DataFrame constructed from the AnnData object since COMPASS takes in a genes x cells matrix.

normalized_pseudobulk_data = pd.DataFrame(pseudobulk_adata.layers['norm_counts'],
                                          index=pseudobulk_adata.obs_names,
                                          columns=pseudobulk_adata.var_names).transpose()
normalized_pseudobulk_data

	N58.LPB1_Inflamed_Epithelial	N58.LPB1_Inflamed_B_cells	N58.LPB1_Inflamed_T_cells	N58.LPB1_Inflamed_Myeloid	N58.LPB1_Inflamed_Fibroblasts	N58.LPB1_Inflamed_Endothelial	N58.LPB1_Inflamed_Glia	N111.LPB1_Inflamed_Epithelial	N111.LPB1_Inflamed_B_cells	N111.LPB1_Inflamed_T_cells	...	N52.EpiA2b_Non-inflamed_Epithelial	N52.EpiA2b_Non-inflamed_T_cells	N52.EpiA2b_Non-inflamed_Myeloid	N52.EpiA2a_Non-inflamed_Epithelial	N52.EpiA2a_Non-inflamed_T_cells	N52.EpiA2a_Non-inflamed_Myeloid	N58.EpiB2_Inflamed_Epithelial	N58.EpiB2_Inflamed_T_cells	N49.EpiA_Non-inflamed_Epithelial	N49.EpiA_Non-inflamed_Myeloid
7SK	0.011480	0.038906	0.057961	0.026216	0.004709	0.014418	0.000000	0.000000	0.040688	0.007955	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
A1BG	0.004592	0.009196	0.000000	0.008739	0.004709	0.014418	0.000000	0.000000	0.006877	0.015909	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
A1BG-AS1	0.004592	0.030417	0.000000	0.069911	0.037673	0.014418	0.000000	0.009021	0.034384	0.039773	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
A1CF	0.119389	0.000707	0.000000	0.000000	0.000000	0.000000	0.000000	0.090212	0.000000	0.000000	...	0.259618	0.0	0.0	0.213429	0.0	0.0	0.0	0.0	0.068964	0.0
A2M	0.073470	0.036783	0.115922	0.830189	5.942971	8.838011	2.314681	0.045106	0.040688	0.067614	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
hsa-mir-5571	0.000000	0.003537	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
hsa-mir-6080	0.066582	0.007781	0.000000	0.034955	0.028255	0.072088	0.578670	0.076680	0.013754	0.031818	...	0.118008	0.0	0.0	0.094858	0.0	0.0	0.0	0.0	0.000000	0.0
hsa-mir-8072	0.025255	0.066493	0.057961	0.052433	0.065928	0.028835	0.000000	0.036085	0.067050	0.019886	...	0.031469	0.0	0.0	0.031619	0.0	0.0	0.0	0.0	0.000000	0.0
snoU109	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0
snoU13	0.004592	0.009196	0.000000	0.000000	0.004709	0.014418	0.000000	0.004511	0.005158	0.007955	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.0

21784 rows × 722 columns

normalized_pseudobulk_data.to_csv('ibd_expression_data.tsv', sep='\t')

Run COMPASS#

Now that we’ve performed pseudobulking, we can move on to run COMPASS on our pseudobulked dataset. This should take around 2-3 hours on 50 processors, although the runtime may vary depending on the system.

if os.path.exists('compass_results') == False:
    os.mkdir('compass_results')

notebook_directory = os.getcwd()
os.chdir('/home/eecs/charleschien101/Compass')
os.system(f'python -m compass.main \
          --data {notebook_directory}/ibd_expression_data.tsv \
          --model RECON2_mat --species homo_sapiens --media default-media \
          --output-dir {notebook_directory}/compass_results --num-processes 50 \
          --select-reactions /home/eecs/charleschien101/Compass/compass/Resources/Metabolic\ Models/RECON2_mat/model/core_reactions.txt')

Cache for model and media already built
Evaluating Reaction Penalties...
Processing 722 samples using 50 processes
Progress bar will update once the first sample is finished

Parse COMPASS results#

To put things in perspective, running COMPASS on just the subset of epithelial cells in healthy patients of the IBD dataset (around 50,000 samples) takes approximately 1 week. When dealing with large-scale datasets, pseudobulking provides a clear advantage in terms of runtime, albeit at the expense of accuracy. In this section of the tutorial we compare the accuracy of results generated using pseudobulked data against the true results.

full_compass_results = pd.read_csv('/data/yosef2/scratch/users/charleschien101/test_turbo_compass_ibd/full_compass/reactions.tsv', sep='\t', index_col=0)
full_compass_results

	N10.EpiA.AAACATACAACCAC	N10.EpiA.AAACATACAGGCGA	N10.EpiA.AAACATACCACTAG	N10.EpiA.AAACATACCCTTTA	N10.EpiA.AAACATACTGCAAC	N10.EpiA.AAACATTGTCCAGA	N10.EpiA.AAACCGTGTCACGA	N10.EpiA.AAACGCACAGCTCA	N10.EpiA.AAACGCACCTACGA	N10.EpiA.AAACGCACCTTGCC	...	N46.LPB.TGTGGTAGTCTGATTG	N46.LPB.TGTGTTTCAAGGCTCC	N46.LPB.TGTGTTTGTCCAGTGC	N46.LPB.TTAGGCAGTCGGGTCT	N46.LPB.TTGCGTCTCACCTCGT	N46.LPB.TTGCGTCTCAGCTCGG	N46.LPB.TTTGGTTTCAATCACG	N46.LPB.TTTGTCAAGTGTCTCA	N46.LPB.TTTGTCACACACTGCG	N46.LPB.TTTGTCATCTGTGCAA
13DAMPPOX_pos	57855.385565	53540.040251	60383.686726	55606.682922	54643.181945	57076.496828	59779.499151	56867.966581	51758.645322	58702.171987	...	57451.641482	51942.850536	50622.822572	58210.616649	55838.119014	52735.172090	56491.623389	50884.852104	57798.525670	59061.428351
24_25VITD2Hm_pos	14.226549	13.705917	14.223131	13.976815	13.465809	13.742612	14.401317	13.858325	13.469690	14.219898	...	13.969429	13.513077	13.395237	13.713766	13.816277	13.601905	14.017772	13.115538	13.924589	14.197245
24_25VITD3Hm_pos	85596.932064	81527.063439	89380.927981	84567.036282	82939.465090	84773.376379	89973.880496	87420.111465	78403.676584	87178.144357	...	85839.111382	81414.916101	79839.028138	87416.351415	85071.183232	80384.005976	82658.531461	78924.417430	88073.461489	88950.191146
25HVITD3c_pos	84003.542293	80591.980324	88479.585549	83548.765195	81932.648697	84185.303284	88232.951360	85736.267470	77029.935910	86166.818782	...	84687.625302	80341.110087	78739.941830	86798.608288	83836.534566	79293.432073	81735.659515	77825.730557	87271.851363	87341.467812
25HVITD3tin_pos	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	...	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
XYLK_pos	14285.584118	12604.589330	15363.725454	13649.736824	13856.550675	14201.644892	15217.604903	14894.645338	12248.068571	14991.047457	...	14526.227651	13366.546337	12034.360348	14454.883894	14855.031146	13012.086231	14054.094722	13085.385614	15728.964226	14265.409850
XYLUR_neg	3528.721199	2122.048918	2824.192987	2667.753557	1692.309736	3029.564452	3537.493243	3137.896702	2625.510477	3231.854120	...	1799.166564	2043.221613	2170.574572	3571.497166	2757.338510	2130.712431	2645.565549	2705.662200	3177.222631	3535.833133
XYLUR_pos	3800.000000	2122.048918	3074.417763	2667.753557	2423.610209	3029.564452	3800.000000	3800.000000	2679.956841	3146.449156	...	1799.166564	2446.203768	2340.207683	3800.000000	3278.669255	2516.051438	2645.565549	3196.419928	3177.222631	3800.000000
XYLt_neg	1329.848580	1319.417808	1415.845544	1378.119179	1305.989797	1345.128020	1386.968880	1291.539796	1270.162781	1386.403095	...	1374.369848	1314.500673	1284.486703	1367.109233	1263.640187	1306.715634	1327.753417	1227.197406	1399.660123	1425.123760
XYLt_pos	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

2321 rows × 50258 columns

pseudobulk_results = pd.read_csv('/data/yosef2/scratch/users/charleschien101/test_pseudobulk/reactions.tsv', sep='\t', index_col=0)
pseudobulk_results

	N58.LPB1_Inflamed_Epithelial	N58.LPB1_Inflamed_B_cells	N58.LPB1_Inflamed_T_cells	N58.LPB1_Inflamed_Myeloid	N58.LPB1_Inflamed_Fibroblasts	N58.LPB1_Inflamed_Endothelial	N58.LPB1_Inflamed_Glia	N111.LPB1_Inflamed_Epithelial	N111.LPB1_Inflamed_B_cells	N111.LPB1_Inflamed_T_cells	...	N52.EpiA2b_Non-inflamed_Epithelial	N52.EpiA2b_Non-inflamed_T_cells	N52.EpiA2b_Non-inflamed_Myeloid	N52.EpiA2a_Non-inflamed_Epithelial	N52.EpiA2a_Non-inflamed_T_cells	N52.EpiA2a_Non-inflamed_Myeloid	N58.EpiB2_Inflamed_Epithelial	N58.EpiB2_Inflamed_T_cells	N49.EpiA_Non-inflamed_Epithelial	N49.EpiA_Non-inflamed_Myeloid
13DAMPPOX_pos	49954.234254	55387.683662	54572.634704	52767.750703	52490.838421	52060.605878	53784.468299	49437.018820	52405.655557	53793.992131	...	49848.043351	57328.136101	55429.211530	49915.452036	57903.182004	55947.945175	51811.748630	60275.637302	49542.055989	58549.425050
24_25VITD2Hm_pos	13.464853	14.126539	14.023334	13.819175	13.879333	13.828841	13.825165	13.421621	13.791743	14.017318	...	13.458125	14.036519	13.959093	13.462110	14.006164	14.028159	13.484712	14.424876	13.205290	14.387264
24_25VITD3Hm_pos	77120.089436	83816.201783	82833.923463	80874.806621	80537.798690	79389.260164	83472.353529	76339.810402	81568.268602	82374.248585	...	77291.364242	83515.216203	84439.784693	77416.918560	83568.202506	84489.677245	79976.881278	89145.273961	78268.023754	85571.958996
25HVITD3c_pos	75633.989004	82385.420150	81289.795371	79259.289258	79117.093928	77786.589317	82142.843942	74911.266351	79986.965202	80793.604223	...	75961.428109	82379.768476	83743.636238	76100.288426	82624.705927	83733.874969	78545.700273	87993.627853	76665.749651	84130.837613
25HVITD3tin_pos	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	...	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
XYLK_pos	12118.762031	13917.227516	13375.529939	12952.395353	12756.160657	12706.940298	13661.519211	12197.786692	12687.007632	12984.798456	...	12524.019304	13714.611863	13978.750586	12523.019007	13720.606099	13975.015167	12497.851923	15333.983649	12642.711910	13933.554471
XYLUR_neg	2331.836903	2665.335682	2680.906234	2488.652979	2235.660378	2418.453592	2191.699957	2247.085833	2549.530933	2512.771006	...	2205.177494	2852.800709	2385.969865	2217.171451	2860.863213	2382.916742	2327.505126	3800.000000	2348.296845	3583.709793
XYLUR_pos	2481.503033	2678.523386	2830.842746	2518.433610	2784.152103	2808.868955	2668.202897	2491.221834	2623.773706	2764.020500	...	2685.095954	2852.800709	2385.969865	2689.050199	2860.863213	2382.916742	2832.472272	3800.000000	2676.325344	3198.834896
XYLt_neg	1267.150935	1335.123149	1324.270558	1293.681380	1286.364376	1290.507136	1362.843997	1255.876859	1320.498073	1336.465403	...	1251.144638	1366.725298	1366.949157	1255.505012	1366.589844	1366.337552	1292.777263	1436.321062	1189.018510	1384.056095
XYLt_pos	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

2321 rows × 722 columns

We now group the output of a full COMPASS run by sample name. Note that all single-cell samples are from epithelial cells of healthy patients.

healthy_sample_names = list(meta[meta['Health'] == 'Healthy']['Sample'].value_counts().index)
healthy_sample_names.sort()

healthy_sample_results = []
for sample in healthy_sample_names:
    sample_results = full_compass_results.loc[:, [sample in col_name for col_name in full_compass_results.columns]]
    healthy_sample_results.append(sample_results.mean(axis=1))
healthy_sample_results = pd.concat(healthy_sample_results, axis=1)
healthy_sample_results.columns = healthy_sample_names
healthy_sample_results

	N10.EpiA	N10.EpiB	N10.LPA	N10.LPB	N11.EpiA	N11.EpiB	N11.LPA	N11.LPB	N13.EpiA	N13.EpiB	...	N46.LPA	N46.LPB	N51.EpiA	N51.EpiB	N51.LPA	N51.LPB	N8.EpiA	N8.EpiB	N8.LPA	N8.LPB
13DAMPPOX_pos	55578.651407	55470.906657	55570.125999	54937.676044	56409.997348	55792.982908	57867.482346	56196.925440	57330.371846	57669.291929	...	55563.618092	55969.642611	57521.978818	57387.069151	56167.925784	55667.192442	56302.202412	56113.470813	55867.806369	55173.565896
24_25VITD2Hm_pos	13.794118	13.786143	13.906611	13.767245	13.839933	13.785692	13.905707	13.939658	13.987922	14.019336	...	13.796592	13.902185	13.994058	13.953505	13.952344	13.945846	13.891757	13.801656	14.064391	13.673429
24_25VITD3Hm_pos	84117.775411	84111.531511	83949.984523	83232.923065	85452.174782	84469.698507	86750.746832	85967.878973	86270.784953	86650.814947	...	83848.265999	84483.224151	86674.963094	86555.553001	84287.645195	84185.205660	84952.776311	84918.778293	85545.018142	83921.851513
25HVITD3c_pos	82883.973050	82887.339652	82777.774394	82098.506352	84148.548547	83191.415597	85636.908326	84467.929709	84993.473551	85398.494273	...	82617.503449	83277.654509	85417.810845	85255.642011	83188.298978	82968.844782	83735.737702	83712.568804	84260.273090	82625.072640
25HVITD3tin_pos	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	...	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
XYLK_pos	14010.427866	13986.919476	13785.947249	13636.181052	14474.763705	14222.559889	14943.910622	14417.083814	14726.182271	14851.123014	...	14088.346260	14217.350621	14881.704324	14864.353245	14135.708826	14071.461771	14254.983728	14253.001346	14223.174490	14107.211083
XYLUR_neg	2602.564929	2592.879784	2600.958901	2524.247729	2891.252037	2789.764038	2684.096337	2533.061741	2967.033166	2993.576352	...	2820.973648	2830.751876	3057.997362	3095.029403	2820.646688	2728.318388	2711.289020	2778.516430	2869.199850	2726.186203
XYLUR_pos	2793.201920	2793.894955	2846.047038	2730.287980	3087.597109	2998.724721	3103.006133	3145.686305	3176.637878	3207.594953	...	3037.752325	3022.542954	3269.881138	3308.776530	2966.110408	2952.381184	2956.205529	3023.554944	3034.593677	2926.450454
XYLt_neg	1325.945409	1325.073590	1345.795887	1335.052432	1329.232259	1319.555192	1391.727385	1347.302068	1367.081516	1370.724405	...	1335.126862	1353.444657	1370.810047	1363.184043	1359.446151	1357.661669	1345.801000	1338.053292	1374.938106	1287.444906
XYLt_pos	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

2321 rows × 48 columns

We now turn to the COMPASS reaction scores generated from pseudobulked data and subset reaction scores from bulk samples that belong to epithelial cells of healthy patients.

healthy_pseudobulk_results = pseudobulk_results.loc[:, [('Healthy' in col_name) and ('Epithelial' in col_name) for col_name in pseudobulk_results.columns]]
healthy_pseudobulk_results = healthy_pseudobulk_results.reindex(sorted(healthy_pseudobulk_results.columns), axis=1)
healthy_pseudobulk_results

	N10.EpiA_Healthy_Epithelial	N10.EpiB_Healthy_Epithelial	N10.LPA_Healthy_Epithelial	N10.LPB_Healthy_Epithelial	N11.EpiA_Healthy_Epithelial	N11.EpiB_Healthy_Epithelial	N11.LPA_Healthy_Epithelial	N11.LPB_Healthy_Epithelial	N13.EpiA_Healthy_Epithelial	N13.EpiB_Healthy_Epithelial	...	N46.LPA_Healthy_Epithelial	N46.LPB_Healthy_Epithelial	N51.EpiA_Healthy_Epithelial	N51.EpiB_Healthy_Epithelial	N51.LPA_Healthy_Epithelial	N51.LPB_Healthy_Epithelial	N8.EpiA_Healthy_Epithelial	N8.EpiB_Healthy_Epithelial	N8.LPA_Healthy_Epithelial	N8.LPB_Healthy_Epithelial
13DAMPPOX_pos	49343.703873	49366.543477	50041.539245	48896.101991	49915.199936	49544.521493	55252.822187	51724.697713	49292.800450	49725.463232	...	49615.464446	49458.331646	49065.168090	48956.253915	50802.412675	50125.492275	49978.430510	50117.124899	52563.082215	50243.282087
24_25VITD2Hm_pos	13.191137	13.202484	13.468908	13.305291	13.190822	13.171305	13.495858	13.373125	13.208837	13.242782	...	13.231858	13.317501	13.110893	13.072062	13.540608	13.556481	13.269303	13.145874	13.691254	13.207760
24_25VITD3Hm_pos	76853.774149	77066.630672	77339.140491	76737.705259	77427.698122	76820.455562	83737.866095	80858.519301	77016.536850	77171.895298	...	77074.676834	77193.646905	76459.056213	76240.398839	78151.207576	77887.207890	77695.783513	77712.041731	81827.291025	78678.487240
25HVITD3c_pos	75378.549906	75639.551367	76022.411801	75443.360618	75893.512459	75347.874830	82949.178039	79438.997588	75541.606542	75916.749105	...	75700.927805	75831.984663	75214.726969	74966.300360	76907.534234	76506.953508	76310.966205	76373.965194	80478.975063	77357.180360
25HVITD3tin_pos	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	...	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000	1900.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
XYLK_pos	12057.939948	12080.690301	12004.401733	12057.076275	12208.845150	12184.657229	14292.701145	12979.042115	12207.686013	12317.468238	...	12320.942860	12163.425350	12205.258151	12176.458619	12555.580720	12176.649491	12277.383367	12195.708797	13120.942342	12656.104737
XYLUR_neg	2000.222456	2029.411271	2087.799752	2069.266719	2111.078437	2070.863760	1893.915551	1891.415234	2167.688474	2140.014923	...	2341.635492	2219.151900	2105.509925	2114.245069	2334.183506	2320.801652	2065.203448	2064.882459	2311.095617	2262.123789
XYLUR_pos	2520.287707	2534.215450	2598.169991	2481.268481	2590.286666	2582.026895	2484.274771	2515.142731	2636.772176	2672.278004	...	2726.333630	2592.140867	2615.184949	2614.880058	2582.766166	2618.800524	2576.986465	2574.977737	2439.604600	2620.344193
XYLt_neg	1169.780792	1176.271699	1266.960914	1232.795134	1168.668379	1169.676019	1354.200067	1231.614154	1182.765693	1187.261424	...	1196.303447	1232.240624	1190.140670	1186.010375	1279.905771	1282.794785	1184.830820	1181.022482	1310.780251	1178.624288
XYLt_pos	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

2321 rows × 48 columns

We ensure that the column names of the selected single cell samples and pseudobulk samples align correctly so that the comparison is fair.

for col_1, col_2 in zip(healthy_sample_results.columns, healthy_pseudobulk_results.columns):
    assert col_1 in col_2

We define some helper functions that allow us to robustly compute the Spearman correlation.

NONCONSTANT_COLUMN_THRESHOLD = 1e-3


def are_equal(u: np.array, v: np.array) -> bool:
    r"""
    The business logic for determining whether two columns are equal lives here.
    """
    return np.max(np.abs(u - v)) < NONCONSTANT_COLUMN_THRESHOLD


def is_nonconstant(v: np.array) -> bool:
    r"""
    The business logic for determining whether a feature is nonconstant lives here.
    :param v: 1D array
    :return: True if the column is considered nonconstant.
    """
    return np.abs(np.nanmax(v) - np.nanmin(v)) > NONCONSTANT_COLUMN_THRESHOLD


def is_constant(v: np.array) -> bool:
    r"""
    The business logic for determining whether a feature is constant lives here.
    :param v: 1D array
    :return: True if the column is considered constant.
    """
    return not is_nonconstant(v)


def compute_spearman_r2(
        y_true: np.array,
        y_pred: np.array) -> float:
    r"""
    Standard Spearman R2 correlation: a number between -1 and 1.
    Note that:
        - predicted columns that are almost equal to the ground truth are declared Spearman R2 1.0
        - predicted columns that are not almost equal yet ground truth is constant, have Spearman R2 0.0.
        - for all other columns, the usual Spearman R2 is computed
    """
    assert(len(y_true) == len(y_pred))
    # Border case 0: there are no true labels or predictions. Return 0.0 (random)
    if len(y_true) == 0:
        return 0.0
    # Border case 1: predictions are very close to ground truth
    if are_equal(y_true, y_pred):
        return 1.0
    # Border case 2: predictions are not very close to ground truth, yet ground truth is constant
    if not are_equal(y_true, y_pred) and is_constant(y_true):
        return 0.0
    # Usual computation:
    # Border case: all true values or predicted values are constant (but the other is not)
    if np.std(y_true) == 0.0:  # pragma: no cover
        raise ValueError("This should never happen!")
    if np.std(y_pred) == 0.0:
        assert(np.std(y_true) != 0.0)
        return 0.0
    spearman_r2 = scipy.stats.spearmanr(y_true, y_pred).correlation

    if np.isnan(spearman_r2):  # pragma: no cover
        return 0.0
    return spearman_r2


def compute_spearman_r2s(
        X_true: np.array,
        X_pred: np.array)\
        -> np.array:
    r"""
    Computes Spearman R2 accross all columns.
    Note that:
        - predicted columns that are almost equal to the ground truth are declared Spearman R2 1.0
        - predicted columns that are not almost equal yet ground truth is constant, have Spearman R2 0.0.
        - for all other columns, the usual Spearman R2 is computed
    """
    nrows, ncols = X_true.shape
    spearman_r2s = np.zeros(shape=(ncols))
    for c in range(ncols):
        y_true = X_true[:, c].flatten()
        y_pred = X_pred[:, c].flatten()
        spearman_r2 = compute_spearman_r2(y_true, y_pred)
        spearman_r2s[c] = spearman_r2
    return spearman_r2s

As shown below, Compass scores generated from pseudobulked samples achieve a median Spearman correlation of 0.53 when compared to the ground truth Compass scores generated from single-cell samples. This demonstrates that pseudobulking does indeed yield significant speedups (3 hours v.s. 1 week!) while maintaining high accuracy.

spearman_r2s = compute_spearman_r2s(healthy_sample_results.transpose().to_numpy(), healthy_pseudobulk_results.transpose().to_numpy())
print(f'Median Spearman correlation: {np.median(spearman_r2s)}')
plt.violinplot(spearman_r2s, quantiles=[0.5])
plt.show()

Median Spearman correlation: 0.5326747720364742

../_images/701b2b786a003275a35ec6ab475ff878ec2d1d04679c4aa051c61478712e7b06.png

Compass Pseudobulking

Contents

Compass Pseudobulking#

Requirements#

Pseudobulking#

Run COMPASS#

Parse COMPASS results#