Assign cell types to groups of cells based on their gene expression profiles

I have large filtered, normalized dataset of scRNA-seq data of C.Elegans species. Rows are genes (10 000), columns are cells (66 000). Let’s say that I got 40 different groups of the cells based on their expression profiles, how could I now assign cell types to these groups? I would expect to get some reference dataset for C.Elegans and compare it with what I got, but I honestly have no idea of where to get such a dataset, and whether there are any R packages that do that. Any advice would be greatly appreciated.

I think the analysis will involve two steps:

1) Compile a reference list of cell types, with a list of genes expressed in each of them, based on the literature. C. elegans has only ~1000 cells, so I'm sure it's been well researched. Mark which genes are just expressed in that particular cell type, and which are uniquely expressed, this latter will be more useful markers.

Resources to do that:

This will involve mostly manual work, but maybe you can find some list.

For example, platelets express VWF. And activated platelets express SELP.

2) Use one of the single-cell RNA-seq packages (e.g. Seurat) to cluster cells and to find the "marker" (DE) genes for each cluster (vs. all other clusters).

For each cluster, look up the marker genes in the reference list and infer the cell type.

You may need to infer the cell type from the functions and descriptions of the genes. Also see Retrieve detailed gene descriptions

Keep in mind that you will have many confusing doublets in the dataset expressing characteristics (genes) of two cells.

You may also need to re-do t-SNE and clustering on a subset of cells, e.g. blood/immune cells, in an iterative, hierarchical way.

Do you have information on which cells are from the same worm? Once you have a reference list, you can try and identify cells of just one worm.

This is an old post, but we have recently published a web-based tool called CIPR (cluster identity predictor) that enables users to upload their data along with a custom reference file for annotating unknown clusters. You can probably obtain relevant reference datasets from GEO.

You can read the manuscript in BMC Bioinformatics for more information about how the algorithm works.

