In the spirit of “Learning in Public,” and “learning exhaust,” I’m going to start adding “Lab Notes” blog posts that chronicle little discoveries or failures that I have along the way to a larger goal. This is the first!

Background

I’m collaborating with Eugene Bolotin from miraomics.bio to understand the AI landscape for Single Cell and Spatial Transcriptomics. If you don’t understand what that means, we’re going to do a series of blog posts soon that give lots of details to get the AI and Bio people on the same page.

We decided to get started on a pretty straightforward problem that is quite useful commercially: cell-type annotation. Cell-type annotation means using the RNA expression levels in a cell to infer what type of cell it is. This has been very well studied, there are lots of traditional machine learning algorithms to do it, but recent Transformer based methods have been claiming SoTA results, so we picked two very different techniques to evaluate with the goal of explore enough of the landscape to get a feel for what actually works the best on large, messy, real-world datasets.

The Incumbent: scGPT

scGPT ( github ) is a decoder-style Transformer model with a specialized attention masking scheme to account for the fact that gene expression data is unordered. It also has a class token for classification and embedding tasks, and the authors included several pre-trained embedding models for cell-type annotation, among other tasks.

For the AI people, the paper is worth reading just to understand the attention mechanism and interesting input embeddings they’ve used for this problem.

We will just use the pre-trained embeddings to get started with our evaluation

The Challenger: GenePT

GenePT ( github ) takes a very different approach. GenePT leverages the wealth of scientific and genetic data used during the training of the OpenAI embedding models. The main idea is quite straightforward, and captured in the following diagram from the paper:

In order to create a gene embedding (a on the diagram), the authors fetch a description of the gene from the NCBI database and stick it into the OpenAI text embedding model . In order to create a cell embedding (b on the diagram), the authors take a weighted sum of all the genes in a cell, weighted by the expression levels, and normalized.

There is a c in the original diagram, but its not relevant here.

Cell-type classification

Both papers go into a lot of detail about many different useful things the model can do, and over time we will get familiar with these and other models across many of these tasks. For now we are focused on cell-type classification. The GenePT paper has this table in an appendix:

According to this table, GenePT wins or almost wins for many of the data sets, with scGPT clearly winning on the MS dataset, and the Ensemble doing well on everything. This inspired us to try out scGPT and GenePT against a different data set and see how they compare “in the wild”.

The Benchmark: Tabula Sapiens

Tabula Sapiens is a carefully curated set of 1.1M cells from 28 organs of 24 normal human subjects. It has labels for 180 cell types across 40 classes of cell. Here’s a very cool browser of the cells to explore the dataset: https://cellxgene.cziscience.com/e/53d208b0-2cfd-4366-9866-c3c6114081bc.cxg/

Here’s the plan:

embed a subset of the first 100K cells (so we can load and experiment quickly) using the pretrained embedding model
select a few different donors to form a few holdout sets
for each doner in the holdout set, train a classifier against the other donors in the 100K subset, test against the holdout doner
compare precision, recall and f1 across doners and classifiers

I use donors to group the cells into the holdout set because cells from the same donor will presumably be pretty similar not generalize well to real-world scenarios, so that would be a data leak.

Those of you familiar with scGPT have probably realized that there is a problem: the scGPT pretrained embeddings were actually finetuned on Tabula Sapiens 🤦. We figured this out after the analysis, but it turns out that the results are still interesting if a bit more ambiguous because of this. We’ll discuss at the bottom.

Step 1.1: embed cells using scGPT

💡

See tabula_sapiens_embed_scgpt.ipynb for a code example

This turns out to be hard to set up and easy to do once you have it working. I cheated and used a prebuilt container on https://latch.bio/, but there are good instructions in this blog post too: A Step-by-Step Guide to install scGPT, including a link to a pre-built container. Once you load the data set (see load_subset_anndata and tabula_sapiens_embed_scgpt.ipynb), you can basically just embed your entire dataset with one call (if it fits in memory):

ref_embed_adata = scg.tasks.embed_data(
    adata_filtered,
    model_dir,
    gene_col=gene_col,
    batch_size=64,
)

On a NVIDIA A10G GPU, this will embed about 325 cells / second.

Here is the obligatory UMAP of 10000 samples from the the embeddings, colored by cell type

It looks like there is pretty good separation between some classes and not others. This may be because of the low numbers of examples in the sample

Step 1.2: embed cells using GenePT

This is slightly more complicated, but I also provide an example in tabula_sapiens_embed_genept.ipynb. Most of the complication comes from needing to break ties between Ensembl IDs (e.g. consistent identifiers for genes) that map to the multiple gene name embeddings. I do this by just taking the mean of the duplicate embeddings, since presumably they will be similar if they are duplicate names for the same gene. This is necessary because the Tabula Sapiens dataset uses Ensembl IDs to identify genes.

Now we can make two function calls:

embedding_matrix, valid_indices = create_embedding_matrix(
    merged_embeddings, major_ensembl_ids
)

This creates the embedding matrix, which will map genes expressions to embeddings. Next we do the embedding.

import time

# Time the embedding creation
start_time = time.time()
cell_embeddings = create_cell_embeddings(
    adata_filtered.X, embedding_matrix, valid_indices
)
end_time = time.time()

# Calculate metrics
total_time = end_time - start_time
cells_per_second = cell_embeddings.shape[0] / total_time

For 100K cells, this takes about 3m42s for a rate of about 450 cells / second on a MacBook Pro M3 with 16G of ram. So, GenePT is dramatically faster!

And a UMAP of 10000 samples from the embedding:

These samples are clearly more spread out, but clustering seems to be pretty good within a class. Obviously, again some classes will be easy confused while others will be pretty easy to distinguish. It’s hard to tell visually if one embedding will be than the other, but they are clearly very different embeddings. Interesting!

Step 2: select holdout doners

There is a sharp imbalance in the number of cells in different broad cell classes in our data subset, to say nothing of the rare cell types:

We probably need to be pretty careful with the donor we select then. Lets take a look at cell class counts vs donors.

TSP1, TSP2 and TSP14 seem to have pretty good overlap, and pretty good levels of most classes, with a few exceptions. Lets keep an eye on kidney epithelial cells, for example, since they only are only really present in TSP2.

Step 3-4: train and measure some classifiers

I’ve found that Random Forest and Gradient Boosting sometimes do better than one another in different circumstances, and KNN is typically used for these types of cell classifications, so I tried all three of these types of models. I did not do any hyper-parameter tuning to start with. As described above, I used a variant of cross validation in which I select a few holdout groups by donor_id (TSP1, TSP2 or TSP14) and judge the performance on each of those, training on the rest of the cells.

One more experiment I added was to try concatenating the GenePT and scGPT vectors and classifying those. This would, in theory, allow the classifier to pick between the two models for the most effective choice for a particular cell type.

💡

Since we are only doing initial evals, I don’t have a final holdout test set, but since this experiment only uses 100K of 1.1M cells, you could consider the other 1M cells to be the test set if we were to need it!

The results of the experiment are in this spreadsheet.

This is a bit hard to read in my browser, so you may want to click through to the spreadsheet, but I’ll give the overview. On the left we see all the different experiments and metrics. We measure precision, recall and f1 score, against GenePT, scGPT and the combined embeddings for both LightGBM and Random Forests. I skipped showing KNN because it performs significantly worse than the other two models in all cases.

On the top we see the three donors, and each cell class that had at least 600 cells in the data set, with all the other cell types being put into an “other” category. On the bottom, we see the training and test set sizes. Since random forest will tend to do better with a full training set even if it is imbalanced, whereas LightGBM will do better if the classes are roughly the same size, I also created a “balanced” training set which caps the size of any class at 1000 samples.

On the right we see the average performance across all of the classes, across all of the classes weighted by the number of test samples in the class, and the unweighted average but excluding test sets smaller than 200 samples (to avoid noise). This last result is probably the most useful unless we are primarily interested in the small cell classes, in which case we should probably tailor the algorithms to that.

The colored tiles in the middle heatmap show the values of precision, recall and f1 score for individual cell types and experiments. Many of the red columns are actually for cell types that have few samples in the test set, so lets hide those.

This looks a lot more green! It looks like there is still pretty significant variation for all models between donors. This may be biologically rooted, or it could be related to batch effects. This is something to study further.

Another thing to notice is that GenePT does basically as well as scGPT, despite being dramatically faster, and scGPT having been trained on the test set!

Conclusion

Lots more to do, including

follow up on batch effects
explore algorithms for improving accuracy on the heavy tail of rare cell types
explore fine tuning hyper-parameters for lightGBM
find a better data set to compare, since scGPT was trained on the test set
explore other tasks besides cell-type classification

I also need to describe how we build our own custom GenePT embeddings. That will come in another Lab Note shortly!

In the mean time, I’d love to hear your thoughts and criticism. Comment here, or find me at

LinkedIn: @rj-honicky
BlueSky: @honicky.bsky.social
X: @honicky

Lab Notes: Comparing GenePT and scGPT