hub

Arc Virtual Cell Atlas: scRNA-seq

The Arc Virtual Cell Atlas hosts one of the biggest collections of scRNA-seq datasets.

Lamin mirrors the dataset for simplified access here: laminlabs/arc-virtual-cell-atlas.

If you use the data academically, please cite the original publications, Youngblut et al. (2025) and Zhang et al. (2025).

Connect to the source instance.

# pip install 'lamindb[jupyter,bionty,wetlab,gcp]'
!lamin connect laminlabs/arc-virtual-cell-atlas
Hide code cell output
 connected lamindb: laminlabs/arc-virtual-cell-atlas

Note

If you want to transfer artifacts or metadata into your own instance, use .using("laminlabs/arc-virtual-cell-atlas") when accessing registries and then .save() (Transfer data).

import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc
import anndata as ad
Hide code cell output
 connected lamindb: laminlabs/arc-virtual-cell-atlas

Tahoe-100M

project_tahoe = ln.Project.get(name="Tahoe-100M")
project_tahoe
Project(uid='H5MwZwyA62rG', name='Tahoe-100M', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', space_id=1, created_by_id=1, created_at=2025-02-26 16:03:40 UTC)
# one collection in this project
project_tahoe.collections.df()
uid key description hash reference reference_type space_id meta_artifact_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 BpavRL4ntRTzWEE50000 tahoe100 None GCLk4ZgQxgWspjmEUk3gIg None None 1 None 2025-02-25 True 3 2025-02-26 13:51:22.787537+00:00 1 None 1

Every individual dataset in the atlas is an .h5ad file that is registered as an artifact in LaminDB.

Artifact level metadata are registered and can be explored as follows:

# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection_tahoe = ln.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
artifacts_tahoe = collection_tahoe.artifacts.distinct()
artifacts_tahoe.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1362 56uA9lPPmJ4zLUcr0000 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908 md5 False False 1 2 3 None True 1 2025-02-25 23:22:17.849980+00:00 1 None 1
1365 9L9HZ55HqUL0aqaR0000 2025-02-25/h5ad/plate13_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 28071589885 RKOiaay+CHvv+Ukk/N+28A None 8501658 md5 False False 1 2 3 None True 1 2025-02-25 23:22:18.977981+00:00 1 None 1
1372 aAHQ3zbD7n1asyYr0000 2025-02-25/h5ad/plate6_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 28934897078 NYvQEqVClziHm0ozWhOw1w None 7545393 md5 False False 1 2 3 None True 1 2025-02-25 23:22:21.629962+00:00 1 None 1
1367 aJIqo7bNyJAs9z0r0000 2025-02-25/h5ad/plate1_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 19070623904 9iCNcouMqfNS3HA/2GUWOA None 5481420 md5 False False 1 2 3 None True 1 2025-02-25 23:22:19.737995+00:00 1 None 1
1375 BDttiuV3Te8VB0dU0000 2025-02-25/h5ad/plate9_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 18791302576 4kHbVbmreg6akW6ZgsjxaA None 5866669 md5 False False 1 2 3 None True 1 2025-02-25 23:22:22.759201+00:00 1 None 1
1374 czC19UpUEszVH2bU0000 2025-02-25/h5ad/plate8_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 30390935958 ilAzEPIh4FlDeTFaJ1dILw None 8880979 md5 False False 1 2 3 None True 1 2025-02-25 23:22:22.387666+00:00 1 None 1
1373 DC5cacdJr1VoEXnl0000 2025-02-25/h5ad/plate7_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 16514746341 NOS4MY6eYYPOnAB8ViyWYg None 5692117 md5 False False 1 2 3 None True 1 2025-02-25 23:22:22.009157+00:00 1 None 1
1371 EZATJLC4jE7pmwo40000 2025-02-25/h5ad/plate5_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 19763140865 VMBKFzOI5cj7UC1UDENP4A None 6419498 md5 False False 1 2 3 None True 1 2025-02-25 23:22:21.255154+00:00 1 None 1
1363 omn7JStfJMzy8m6O0000 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869 md5 False False 1 2 3 None True 1 2025-02-25 23:22:18.229629+00:00 1 None 1
1364 S2h2rPLCaUhZAM9u0000 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057 md5 False False 1 2 3 None True 1 2025-02-25 23:22:18.600910+00:00 1 None 1
1370 tKTeff0ugWqAm4P70000 2025-02-25/h5ad/plate4_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 23292672278 BkBXznbSovNWXtzPFITPcQ None 7004356 md5 False False 1 2 3 None True 1 2025-02-25 23:22:20.879928+00:00 1 None 1
1366 vn5cUJCHbjpPPsZx0000 2025-02-25/h5ad/plate14_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 22427932564 FrnStRehP16siRGG35ou+g None 6518806 md5 False False 1 2 3 None True 1 2025-02-25 23:22:19.357999+00:00 1 None 1
1369 XVSrkq9pyF1OBLgG0000 2025-02-25/h5ad/plate3_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 13173722269 Jnrt7DaSUCGn8D8LS2itaw None 4705402 md5 False False 1 2 3 None True 1 2025-02-25 23:22:20.497965+00:00 1 None 1
1368 ZFeVfd0ugAHeWCxm0000 2025-02-25/h5ad/plate2_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 29037152127 usxviuqGbuw0RYnECCVCWw None 8064658 md5 False False 1 2 3 None True 1 2025-02-25 23:22:20.113956+00:00 1 None 1

50 cell lines.

artifacts_tahoe.list("cell_lines__name")[:5]
['A-172', 'A-427', 'A498', 'A549', 'AN3 CA']

380 compounds.

artifacts_tahoe.list("compounds__name")[:5]
['18β-Glycyrrhetinic acid',
 '4EGI-1',
 '5-Azacytidine',
 '5-Fluorouracil',
 '8-Hydroxyquinoline']

1,138 perturbations.

artifacts_tahoe.list("compound_perturbations__name")[:5]
["[('18β-Glycyrrhetinic acid', 0.05, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 0.5, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 5.0, 'uM')]",
 "[('4EGI-1', 0.05, 'uM')]",
 "[('4EGI-1', 0.5, 'uM')]"]
# check the curated metadata of the first artifact
artifact1 = artifacts_tahoe[0]
artifact1.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = '56uA9lPPmJ4zLUcr0000'
│   ├── .key = '2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad'
│   ├── .size = 26536400717
│   ├── .hash = 'j1FXsX7hs7u+eBqnWnmNHw'
│   ├── .n_observations = 8044908
│   ├── .path = gs://arc-ctc-tahoe100/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad
│   ├── .created_by = sunnyosun (Sunny Sun)
│   ├── .created_at = 2025-02-25 23:22:17
│   └── .transform = 'Register Tahoe-100M'
├── Dataset features/.feature_sets
│   ├── var62710                 [bionty.Gene.stable_id]                                             
│   │   TSPAN6                      float                                                               
│   │   TNMD                        float                                                               
│   │   DPM1                        float                                                               
│   │   SCYL3                       float                                                               
│   │   C1orf112                    float                                                               
│   │   FGR                         float                                                               
│   │   CFH                         float                                                               
│   │   FUCA2                       float                                                               
│   │   GCLC                        float                                                               
│   │   NFYA                        float                                                               
│   │   STPG1                       float                                                               
│   │   NIPAL3                      float                                                               
│   │   LAS1L                       float                                                               
│   │   ENPP4                       float                                                               
│   │   SEMA3F                      float                                                               
│   │   CFTR                        float                                                               
│   │   ANKIB1                      float                                                               
│   │   CYP51A1                     float                                                               
│   │   KRIT1                       float                                                               
│   │   RAD52                       float                                                               
│   └── obs16                    [Feature]                                                           
cell_line                   cat[bionty.CellLine.desc…  A-172, A-427, A498, A549, AN3 CA, AsPC-1…
cell_name                   cat[bionty.CellLine]       A-172, A-427, A498, A549, AN3 CA, AsPC-1…
drug                        cat[wetlab.Compound]       5-Azacytidine, 5-Fluorouracil, Abiratero…
drugname_drugconc           cat[wetlab.CompoundPertu…  [('5-Azacytidine', 0.05, 'uM')], [('5-Fl…
pass_filter                 cat[ULabel[PassFilter]]    full, minimal                            
phase                       cat[ULabel[Phase]]         G1, G2M, S                               
plate                       cat[ULabel[Plate]]         plate10                                  
sample                      cat[wetlab.Biosample]      smp_2359, smp_2360, smp_2361, smp_2362, …
gene_count                  int                                                                 
tscp_count                  int                                                                 
mread_count                 int                                                                 
pcnt_mito                   float                                                               
S_score                     float                                                               
G2M_score                   float                                                               
sublibrary                  str                                                                 
BARCODE                     str                                                                 
└── Labels
    └── .references                 Reference                  Tahoe-100M: A Giga-Scale Single-Cell Per…
        .projects                   Project                    Tahoe-100M                               
        .organisms                  bionty.Organism            human                                    
        .cell_lines                 bionty.CellLine            NCI-H1573, NCI-H460, hTERT-HPNE, SW48, H…
        .compounds                  wetlab.Compound            Acetazolamide, Neratinib, Tazarotene, 5-…
        .compound_perturbations     wetlab.CompoundPerturbat…  [('5-Azacytidine', 0.05, 'uM')], [('Iver…
        .biosamples                 wetlab.Biosample           smp_2430, smp_2365, smp_2360, smp_2369, …
        .ulabels                    ULabel                     tahoe-100, plate10, G1, G2M, S, full, mi…

16 obs metadata features.

artifact1.features["obs"].df()
Hide code cell output
/tmp/ipykernel_3630/2428349911.py:1: FutureWarning: Use slots[slot].members instead of __getitem__, __getitem__ will be removed in the future.
  artifact1.features["obs"].df()
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux _branch_code
id
9 bujDkB4Nd1S5 S_score float None None Inferred S phase score 0 0 None None None True None 1 None 3 2025-02-25 22:31:22.144135+00:00 1 {'af': {'0': None, '1': True}} 1
3 PVpyJhciLdCQ pass_filter cat[ULabel[PassFilter]] None None "Full" filters are more stringent on gene_coun... 0 0 None None None True None 1 None 3 2025-02-25 22:25:30.918235+00:00 1 {'af': {'0': None, '1': True}} 1
7 PZDiL36nJSFv mread_count int None None Number of reads per cell 0 0 None None None True None 1 None 3 2025-02-25 22:30:31.810331+00:00 1 {'af': {'0': None, '1': True}} 1
4 vshELphl73qp cell_line cat[bionty.CellLine.description] None None Cell line information (if applicable) 0 0 None None None True None 1 None 3 2025-02-25 22:27:22.393997+00:00 1 {'af': {'0': None, '1': True}} 1
1 YRSYWdIiesqL plate cat[ULabel[Plate]] None None Plate identifier 0 0 None None None True None 1 None 3 2025-02-25 22:03:51.786985+00:00 1 {'af': {'0': None, '1': True}} 1
19 gQE1h3fIBiSf sample cat[wetlab.Biosample] None None Unique treatment identifier, distinguishes rep... 0 0 None None None True None 1 None 3 2025-02-26 10:59:36.743558+00:00 1 {'af': {'0': None, '1': True}} 1
5 IjSP1lCY3Hyw gene_count int None None Number of genes with at least one count 0 0 None None None True None 1 None 3 2025-02-25 22:30:30.668750+00:00 1 {'af': {'0': None, '1': True}} 1
6 LHUmmYKjIGPl tscp_count int None None Number of transcripts, aka UMI count 0 0 None None None True None 1 None 3 2025-02-25 22:30:31.236532+00:00 1 {'af': {'0': None, '1': True}} 1
18 fLwdFKBUhBY9 drugname_drugconc cat[wetlab.CompoundPerturbation] None None Drug name, concentration, and concentration unit 0 0 None None None True None 1 None 3 2025-02-25 23:04:17.541812+00:00 1 {'af': {'0': None, '1': True}} 1
17 Q0cj2JR5Juwn drug cat[wetlab.Compound] None None Drug name, parsed out from the drugname_drugco... 0 0 None None None True None 1 None 3 2025-02-25 23:02:05.717794+00:00 1 {'af': {'0': None, '1': True}} 1
15 3X4d0QEUuprp sublibrary str None None Sublibrary ID (related to library prep and seq... 0 0 None None None True None 1 None 3 2025-02-25 22:35:14.673178+00:00 1 {'af': {'0': None, '1': True}} 1
16 dQELv2sIVnJX BARCODE str None None Barcode ID 0 0 None None None True None 1 None 3 2025-02-25 22:35:15.627971+00:00 1 {'af': {'0': None, '1': True}} 1
8 X640W5tBUPOQ pcnt_mito float None None Percentage of mitochondrial reads 0 0 None None None True None 1 None 3 2025-02-25 22:31:21.581885+00:00 1 {'af': {'0': None, '1': True}} 1
10 CF0O0e0WZxFz G2M_score float None None Inferred G2M score 0 0 None None None True None 1 None 3 2025-02-25 22:31:22.708895+00:00 1 {'af': {'0': None, '1': True}} 1
2 QboQ1Q1Yxsjn phase cat[ULabel[Phase]] None None Inferred cell cycle phase 0 0 None None None True None 1 None 3 2025-02-25 22:21:56.935262+00:00 1 {'af': {'0': None, '1': True}} 1
11 KPT70T8xJLIt cell_name cat[bionty.CellLine] None None Commonly-used cell name (related to the cell_l... 0 0 None None None True None 1 None 3 2025-02-25 22:32:56.082195+00:00 1 {'af': {'0': None, '1': True}} 1

Query artifacts of interest based on metadata

Since all metadata are registered in the sql database, we can explore the datasets without accessing them.

Let’s find which datasets contain A549 cells perturbed with Piroxicam.

# lookup objects give you pythonic access to the values
cell_lines = bt.CellLine.lookup("ontology_id")
drugs = wl.Compound.lookup()

artifacts_a549_piroxicam = artifacts_tahoe.filter(
    cell_lines=cell_lines.cvcl_0023, compounds=drugs.piroxicam
)
artifacts_a549_piroxicam.df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1362 56uA9lPPmJ4zLUcr0000 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908 md5 False False 1 2 3 None True 1 2025-02-25 23:22:17.849980+00:00 1 None 1
1363 omn7JStfJMzy8m6O0000 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869 md5 False False 1 2 3 None True 1 2025-02-25 23:22:18.229629+00:00 1 None 1
1364 S2h2rPLCaUhZAM9u0000 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057 md5 False False 1 2 3 None True 1 2025-02-25 23:22:18.600910+00:00 1 None 1

You can download an .h5ad into your local cache:

artifact1.cache()

Or stream it:

artifact1.open()

Open the obs metadata parquet file as a PyArrow Dataset

Open the obs metadata file (2.29G) with PyArrow.Dataset.

obs_metadata = ln.Artifact.filter(
    key__endswith="obs_metadata.parquet", projects=project_tahoe
).one()
obs_metadata
Artifact(uid='y1TTR9wbrmZEwpOa0000', is_latest=True, key='2025-02-25/metadata/obs_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=2293981573, hash='qEWOpGw9CmQVzaElyMWT1Q', n_observations=100648790, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-02-25 19:33:42 UTC)
obs_metadata_ds = obs_metadata.open()
obs_metadata_ds.schema
Hide code cell output
plate: string
BARCODE_SUB_LIB_ID: string
sample: string
gene_count: int64
tscp_count: int64
mread_count: int64
drugname_drugconc: string
drug: string
cell_line: dictionary<values=string, indices=int32, ordered=0>
sublibrary: string
BARCODE: string
pcnt_mito: float
S_score: double
G2M_score: double
phase: dictionary<values=string, indices=int32, ordered=0>
pass_filter: dictionary<values=string, indices=int32, ordered=0>
cell_name: dictionary<values=string, indices=int32, ordered=0>
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 2487

Which A549 cells are perturbed with Piroxicam.

filter_expr = (pc.field("cell_name") == cell_lines.cvcl_0023.name) & (
    pc.field("drug") == drugs.piroxicam.name
)
obs_metadata_df = obs_metadata_ds.scanner(filter=filter_expr).to_table().to_pandas()
obs_metadata_df.value_counts("plate")
plate
plate12    2818
plate10    2812
plate11    2279
Name: count, dtype: int64
obs_metadata_df.head()
plate BARCODE_SUB_LIB_ID sample gene_count tscp_count mread_count drugname_drugconc drug cell_line sublibrary BARCODE pcnt_mito S_score G2M_score phase pass_filter cell_name
29314 plate10 50_030_183-lib_1681 smp_2408 644 863 1024 [('Piroxicam', 0.05, 'uM')] Piroxicam CVCL_0023 lib_1681 50_030_183 0.101970 -0.282297 -0.165568 G1 full A549
29337 plate10 50_035_135-lib_1681 smp_2408 1130 1570 1827 [('Piroxicam', 0.05, 'uM')] Piroxicam CVCL_0023 lib_1681 50_035_135 0.077070 -0.335042 -0.280220 G1 full A549
29338 plate10 50_035_171-lib_1681 smp_2408 1058 1534 1809 [('Piroxicam', 0.05, 'uM')] Piroxicam CVCL_0023 lib_1681 50_035_171 0.124511 -0.402028 -0.404579 G1 full A549
29352 plate10 50_038_157-lib_1681 smp_2408 1265 1883 2240 [('Piroxicam', 0.05, 'uM')] Piroxicam CVCL_0023 lib_1681 50_038_157 0.147106 -0.455343 -0.311355 G1 full A549
29355 plate10 50_039_078-lib_1681 smp_2408 1355 1914 2258 [('Piroxicam', 0.05, 'uM')] Piroxicam CVCL_0023 lib_1681 50_039_078 0.070010 -0.349396 0.186264 G2M full A549

Retrieve the corresponding cells from h5ad files.

plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

adatas = []
for artifact in artifacts_a549_piroxicam:
    plate = artifact.features.get_values()["plate"]
    idxs = plate_cells.get(plate)
    print(f"Loading {len(idxs)} cells from plate {plate}")
    with artifact.open() as store:
        adata = store[idxs].to_memory() # can also subst genes here
        adatas.append(adata)

scBaseCamp

project_scbasecamp = ln.Project.get(name="scBaseCamp")
project_scbasecamp
Project(uid='vdK00t9DGwHP', name='scBaseCamp', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', space_id=1, created_by_id=1, created_at=2025-02-26 16:04:08 UTC)

This project has 105 collections (21 organisms x 5 count features):

project_scbasecamp.collections.df()
Hide code cell output
uid key description hash reference reference_type space_id meta_artifact_id version is_latest run_id created_at created_by_id _aux _branch_code
id
87 QyeOMM8Qu2Yc637f0000 scBaseCamp/Velocyto/Schistosoma_mansoni None 7XZzjMBlIJQMqrcOhYFQYQ None None 1 None 2025-02-25 True 10 2025-03-03 11:07:36.194395+00:00 1 None 1
71 rForlsvLjM8zEgbO0000 scBaseCamp/GeneFull_ExonOverIntron/Oryza_sativa None SqNuN0qVtQskeDnAZPRLrQ None None 1 None 2025-02-25 True 10 2025-03-03 11:06:15.137130+00:00 1 None 1
68 wXctL2347aWNGnf90000 scBaseCamp/Gene/Oryza_sativa None LTqCz0GuUi1CnbHM_zi9qw None None 1 None 2025-02-25 True 10 2025-03-03 11:06:00.109765+00:00 1 None 1
51 nJV1L9cV1nev1OmF0000 scBaseCamp/GeneFull_ExonOverIntron/Heterocepha... None T6J_WY2k420oM5BE_I0rpA None None 1 None 2025-02-25 True 10 2025-03-03 11:03:47.412575+00:00 1 None 1
80 nBrtxyYP9yzufHe70000 scBaseCamp/GeneFull_Ex50pAS/Pan_troglodytes None JF1_XDO5EFM13xRBxDCSaQ None None 1 None 2025-02-25 True 10 2025-03-03 11:07:01.132150+00:00 1 None 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
55 BLamUQZhqBTnHG4K0000 scBaseCamp/GeneFull_Ex50pAS/Homo_sapiens None SLBug97gNkMCZ3Gd2Bp1Aw None None 1 None 2025-02-25 True 10 2025-03-03 11:04:28.695376+00:00 1 None 1
27 2wPZaiNxigodW7X60000 scBaseCamp/Velocyto/Danio_rerio None ceCKmkcgKyk_bRHhjGodTQ None None 1 None 2025-02-25 True 10 2025-03-03 11:01:45.771604+00:00 1 None 1
23 kXjTL9XbRysx3A8P0000 scBaseCamp/Gene/Danio_rerio None TOhVCAQMVTRO8VD27SF6WQ None None 1 None 2025-02-25 True 10 2025-03-03 11:01:25.162863+00:00 1 None 1
58 TMcFueJifRSFVrSq0000 scBaseCamp/Gene/Macaca_mulatta None OuNCmFSkmfKiLjvGEbBVKw None None 1 None 2025-02-25 True 10 2025-03-03 11:05:04.524140+00:00 1 None 1
8 ttGkPgXxLDO4sSXF0000 scBaseCamp/Gene/Bos_taurus None jn1Nhcdt0lpB1I3hQ4SgFw None None 1 None 2025-02-25 True 10 2025-03-03 11:00:09.130314+00:00 1 None 1

105 rows × 15 columns

Query artifacts of interest based on metadata

Often you might not want to access all the h5ads in a collection, but rather filter them by metadata:

organisms = bt.Organism.lookup()
tissues = bt.Tissue.lookup()
efos = bt.ExperimentalFactor.lookup()
feature_counts = ln.ULabel.filter(type__name="STARsolo count features").lookup()
h5ads_brain = ln.Artifact.filter(
    suffix=".h5ad",
    projects=project_scbasecamp,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
    tissues=tissues.brain,
    experimental_factors=efos.single_cell,
    experiments__name__contains="CRISPRi",  # `perturbation` column is registered in `wetlab.Experiment`
).distinct()

h5ads_brain.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
104180 1AlmBH0wFzUqosGV0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 3448668 A0k605SWKyxecLUFjNqS8A None 6164 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104186 24rg7gDQqP0EQRq30000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 35229865 EA3jW7rwaZhIwtZpLLNCQQ None 7463 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104204 2vZHojPycv8uPoXp0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 35133716 Ud5Je3ue2dQcG53leo1nhA None 4709 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104174 3EbJEIJnCGqnEMUI0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 5727864 nddvJ0NRE3/rTAfQgyubow None 7376 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104205 3JlzQ4PcN58pOxM50000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 35877513 elUEIdXpHR1xfltqUYPBgw None 4718 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104197 Wg6YBPWCwfU4Vr960000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 38354054 JJCCXbqWTaIeV5vJvOllzw None 7627 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104170 YqiNrGCXc1cM9Dg90000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 5494309 kMbDZo5QMSt3WzLKZjsdCg None 7383 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104219 zAxkTKnxCUEBAibd0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 37935375 D/xXUsmFZ14802xqd5cWaw None 7616 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104206 ZgGYpGntv2sF92Wg0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 36858036 fUND8GyVTUu3KrDEhmYYLg None 9128 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1
104166 ZmSJbhRC4WeK1nyA0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 40518635 gdcEf34j7wAVvxcUby9UDw None 7114 None False True 1 3 55 None True 10 2025-02-28 16:46:25.771217+00:00 1 None 1

64 rows × 23 columns

Load the h5ad files with obs metadata

Load the h5ads as a single AnnData:

adatas = []
for artifact in h5ads_brain[:5]:  # only load the first 5 artifacts to save CI time
    adatas.append(artifact.load())

# the obs metadatas are present in the parquet files
adata_concat = ad.concat(adatas)
adata_concat
Hide code cell output
/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 38206 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession'

Open the sample metadata:

sample_meta = ln.Artifact.filter(
    key__endswith="sample_metadata.parquet",
    projects=project_scbasecamp,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
).one()

sample_meta
Artifact(uid='WCHkcyWN8L6pDI4E0000', is_latest=True, key='2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=531878, hash='4QrqW8DQVRl6bKNYiJhq3g', n_observations=16077, space_id=1, storage_id=3, run_id=2, created_by_id=1, created_at=2025-02-25 20:41:32 UTC)
sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema
Hide code cell output
entrez_id: int64
srx_accession: string
file_path: string
obs_count: int64
lib_prep: string
tech_10x: string
cell_prep: string
organism: string
tissue: string
disease: string
perturbation: string
cell_line: string
czi_collection_id: string
czi_collection_name: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1755

Fetch corresponding sample metadata:

filter_expr = pc.field("srx_accession").isin(
    adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()

Add the sample metadata to the AnnData:

adata_concat.obs = adata_concat.obs.merge(
    df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat
AnnData object with n_obs × n_vars = 38206 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
adata_concat.obs.head()
Hide code cell output
gene_count umi_count SRX_accession entrez_id srx_accession file_path obs_count lib_prep tech_10x cell_prep organism tissue disease perturbation cell_line czi_collection_id czi_collection_name
0 2748 5134.0 SRX10606628 14083632 SRX10606628 gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
1 2351 4639.0 SRX10606628 14083632 SRX10606628 gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
2 2184 4293.0 SRX10606628 14083632 SRX10606628 gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
3 2469 5307.0 SRX10606628 14083632 SRX10606628 gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
4 4144 9340.0 SRX10606628 14083632 SRX10606628 gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None