Pipeline overview

Given a cell by gene matrix, this scanpy based pipeline performs data preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain/Leiden/kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP/t-SNE). The following block diagram illustrates the pipeline.

Download entire repository

cd ~
RUN wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz 
RUN tar -xzf Source_code_with_submodules.tar.gz
cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis

Instructions to Run

We can run this pipeline in three ways: 1. Docker container (i. interactive, ii. non-interactive), 2. Using anaconda environment file, 3. Creating anaconda environment manually.

(Option 1): Docker instructions for interactive and non-interactive mode (Recommended on Cloud Instance)

Run with jupyter notebook (interactive)

cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/
docker build -t scanpy .           # Create a docker image named scanpy

# Download dataset
wget -P ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad

docker run -it -p 8888:8888 -v ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/data:/data scanpy   # run docker container with the data folder as volume

Run with non-interactive mode

export DATA_DIR=<path-to-database-directory>
export OUTPUT_DIR=<path-to-output-directory>
mkdir -p $OUTPUT_DIR
cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/

docker build -f Dockerfile.python -t scanpy_python . # Create a docker image named scanpy_python

# Download dataset
wget -P  $DATA_DIR https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad

docker run -v $OUTPUT_DIR:/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/notebooks/figures  -v $DATA_DIR:/data  -it scanpy_python

(Option 2): Create an Anaconda environment from file

conda env create --name=single_cell -f environment.yml
conda activate single_cell

Replace the _t_sne.py file to anaconda environment’s daal4py package

cp _t_sne.py ~/anaconda3/envs/single_cell/lib/python3.8/site-packages/daal4py/sklearn/manifold/

Install umap_extend and umap

pip uninstall umap-learn
cd ~/Open-Omics-Acceleration-Framework/lib/tal/applications/UMAP_fast/umap_extend
python setup.py install                          # Uncomment AVX512 lines in setup.py before doing this step on avx512 machines


cd ~/Open-Omics-Acceleration-Framework/lib/tal/applications/UMAP_fast/umap
python setup.py install                     # do python setup.py install if moving environment using conda-pack

Example Dataset

The dataset was made publicly available by 10X Genomics. Use the following command to download the count matrix for this dataset and store it in the data folder:

wget -P ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad

Setup and run

export NUMEXPR_MAX_THREADS=56          # equal to number of threads on a single socket
export NUMBA_NUM_THREADS=56            # Remember to delete __pycache__ folder from local directory and umap/umap/ directory if increasing number of threads

# also update sc.settings.n_jobs=56 to set number of threads inside 1M_brain_cpu_analysis.py

cd ~/Open-Omics-Acceleration-Framework/pipelines/single-cell-RNA-seq-analysis/notebooks/

# Or the jupyter notebook with sklearn patch in it. 
# from sklearnex import patch_sklearn
# patch_sklearn()

jupyter notebook

(Alternatively, Option - 3) You can also create Anaconda environment Manually

conda create --name single_cell python=3.8.0
conda activate single_cell

Necessary scanpy tools

conda install -y seaborn=0.12.2 scikit-learn=1.0.2 statsmodels=0.13.2 numba=0.53.1 pytables=3.7.0 matplotlib-base=3.6.2 pandas=1.5.2
conda install -y -c conda-forge mkl-service=2.4.0
conda install -y -c conda-forge python-igraph=0.10.3 leidenalg=0.9.1
conda install -y -c conda-forge cython=0.29.33 jinja2=3.1.2 clang-tools=15.0.7
conda install -y -c katanagraph/label/dev -c conda-forge katana-python

Install scanpy

pip install scanpy==1.8.1

Install scikit-learn intel extension (PIP version)

pip install scikit-learn-intelex==2023.0.1

Install other packages

pip install pybind11
pip install jupyterlab
pip install wget

Replace the _t_sne.py file to anaconda environment’s daal4py package

cp _t_sne.py ~/anaconda3/envs/single_cell/lib/python3.8/site-packages/daal4py/sklearn/manifold/

Install umap_extend and umap

pip uninstall umap-learn
cd ~/Open-Omics-Acceleration-Framework/lib/tal/applications/UMAP_fast/umap_extend
python setup.py install                          # Uncomment AVX512 lines in setup.py before doing this step on avx512 machines


cd ~/Open-Omics-Acceleration-Framework/lib/tal/applications/UMAP_fast/umap
python setup.py install                     # do python setup.py install if moving environment using conda-pack