lamindb.Schema¶
- class lamindb.Schema(features: Iterable[Record] | None = None, components: dict[str, Schema] | None = None, name: str | None = None, description: str | None = None, dtype: str | None = None, itype: str | Registry | FieldAttr | None = None, type: Schema | None = None, is_type: bool = False, otype: str | None = None, minimal_set: bool = True, ordered_set: bool = False, maximal_set: bool = False, slot: str | None = None, coerce_dtype: bool = False)¶
Bases:
Record,CanCurate,TracksRunSchemas.
The simplest schema is a feature set such as the set of columns of a
DataFrame.A composite schema has multiple components, e.g., for an
AnnData, one schema forobsand another one forvar.- Parameters:
features –
Iterable[Record] | None = NoneAn iterable ofFeaturerecords to hash, e.g.,[Feature(...), Feature(...)]. Is turned into a set upon instantiation. If you’d like to pass values, usefrom_values()orfrom_df().components –
dict[str, Schema] | None = NoneA dictionary mapping component names to their correspondingSchemaobjects for composite schemas.name –
str | None = NoneA name.description –
str | None = NoneA description.dtype –
str | None = NoneThe simple type. Defaults toNonefor sets ofFeaturerecords. Otherwise defaults to"num"(e.g., for sets ofGene).itype –
str | None = NoneThe feature identifier type (e.g.Feature,Gene, …).type –
Schema | None = NoneA type.is_type –
bool = FalseDistinguish types from instances of the type.otype –
str | None = NoneAn object type to define the structure of a composite schema.minimal_set –
bool = TrueWhether the schema contains a minimal set of linked features.ordered_set –
bool = FalseWhether features are required to be ordered.maximal_set –
bool = FalseIfTrue, no additional features are allowed.slot –
str | None = NoneThe slot name when this schema is used as a component in a composite schema.coerce_dtype –
bool = FalseWhen True, attempts to coerce values to the specified dtype during validation, seecoerce_dtype.
Why does LaminDB model schemas, not just features?
Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you can link all your artifacts against one feature set and only need to store 1M instead of 1M x 20k = 20B links.
Interpretation: Model protein panels, gene panels, etc.
Data integration: Feature sets provide the information that determines whether two datasets can be meaningfully concatenated.
These reasons do not hold for label sets. Hence, LaminDB does not model label sets.
Note
A feature set can be identified by the
hashof its feature uids. It’s stored in the.hashfield.A
slotprovides a string key to access feature sets. For instance, for the schema of anAnnDataobject, it would be'obs'foradata.obs.See also
from_values()Create from values.
from_df()Create from dataframe columns.
Examples
Create a schema (feature set) from df with types:
>>> df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]}) >>> schema = ln.Schema.from_df(df)
Create a schema (feature set) from features:
>>> features = [ln.Feature(name=feat, dtype="float").save() for feat in ["feat1", "feat2"]] >>> schema = ln.Schema(features)
Create a schema (feature set) from identifier values:
>>> import bionty as bt >>> schema = ln.Schema.from_values(adata.var["ensemble_id"], Gene.ensembl_gene_id, organism="mouse").save()
Attributes¶
- property coerce_dtype: bool¶
Whether dtypes should be coerced during validation.
For example, a
objects-dtyped pandas column can be coerced tocategoricaland would pass validation if this is true.
- property slots: dict[str, Schema]¶
Slots.
Examples:
# define composite schema anndata_schema = ln.Schema( name="small_dataset1_anndata_schema", otype="AnnData", components={"obs": obs_schema, "var": var_schema}, ).save() # access slots anndata_schema.slots # {'obs': <Schema: obs_schema>, 'var': <Schema: var_schema>}
Simple fields¶
- uid: str¶
A universal id (hash of the set of feature values).
- name: str | None¶
A name.
- description: str | None¶
A description.
- n¶
Number of features in the set.
- dtype: str | None¶
Data type, e.g., “num”, “float”, “int”. Is
NoneforFeature.For
Feature, types are expected to be heterogeneous and defined on a per-feature level.
- itype: str | None¶
A registry that stores feature identifiers used in this schema, e.g.,
'Feature'or'bionty.Gene'.Depending on the registry,
.membersstores, e.g.,Featureorbionty.Generecords.Changed in version 1.0.0: Was called
registrybefore.
- is_type: bool¶
Distinguish types from instances of the type.
- otype: str | None¶
Default Python object type, e.g., DataFrame, AnnData.
- hash: str | None¶
A hash of the set of feature identifiers.
For a composite schema, the hash of hashes.
- minimal_set: bool¶
Whether the schema contains a minimal set of linked features (default
True).If
False, no features are linked to this schema.If
True, features are linked and considered as a minimally required set in validation.
- ordered_set: bool¶
Whether features are required to be ordered (default
False).
- maximal_set: bool¶
If
False, additional features are allowed (defaultFalse).If
True, the the minimal set is a maximal set and no additional features are allowed.
- slot: str | None¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- created_at: datetime¶
Time of creation of record.
Relational fields¶
-
type:
Schema| None¶ Type of schema.
Allows to group schemas by type, e.g., all meassurements evaluating gene expression vs. protein expression vs. multi modal.
You can define types via
ln.Schema(name="ProteinPanel", is_type=True).Here are a few more examples for type names:
'ExpressionPanel','ProteinPanel','Multimodal','Metadata','Embedding'.
- params: Param¶
The params contained in the schema.
-
composites:
Schema¶ The composite schemas that contains this schema as a component.
For example, an
AnnDatacomposes multiple schemas:var[DataFrameT],obs[DataFrame],obsm[Array],uns[dict], etc.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame.By default, shows all direct fields, except
updated_at.Use arguments
includeorfeatureto include other data.- Parameters:
include (
str|list[str] |None, default:None) – Related fields to include as columns. Takes strings of form"ulabels__name","cell_types__name", etc. or a list of such strings.features (
bool|list[str], default:False) – IfTrue, map all features of theFeatureregistry onto the resultingDataFrame. Only available forArtifact.limit (
int, default:100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A
QuerySet.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_df(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)¶
Create feature set for validated features.
- Return type:
Schema|None
- classmethod from_values(values, field=FieldAttr(Feature.name), type=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)¶
Create feature set for validated features.
- Parameters:
values (
list[str] |Series|array) – A list of values, like feature names or ids.field (
DeferredAttribute, default:FieldAttr(Feature.name)) – The field of a reference registry to map values.type (
str|None, default:None) – The simple type. Defaults toNoneif reference registry isFeature, defaults to"float"otherwise.name (
str|None, default:None) – A name.organism (
Record|str|None, default:None) – An organism to resolve gene mapping.source (
Record|None, default:None) – A public ontology to resolve feature identifier mapping.raise_validation_error (
bool, default:True) – Whether to raise a validation error if some values are not valid.
- Raises:
ValidationError – If some values are not valid.
- Return type:
Examples
>>> features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]] >>> schema = ln.Schema.from_values(features)
>>> genes = ["ENSG00000139618", "ENSG00000198786"] >>> schema = ln.Schema.from_values(features, bt.Gene.ensembl_gene_id, "float")
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list[str] |Series|array) – Values that will be checked against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to inspect against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol) >>> result.validated ['A1CF', 'A1BG'] >>> result.non_validated ['FANCD1', 'FANCD20']
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, public_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
Iterable) – Identifiers that will be standardized.field (
str|DeferredAttribute|None, default:None) – The field representing the standardized names.return_field (
str|DeferredAttribute|None, default:None) – The field to return. Defaults to field.return_mapper (
bool, default:False) – IfTrue, returns{input_value: standardized_name}.case_sensitive (
bool, default:False) – Whether the mapping is case sensitive.mute (
bool, default:False) – Whether to mute logging.public_aware (
bool, default:True) – Whether to standardize from Bionty reference. Defaults toTruefor Bionty registries.keep (
Literal['first','last',False], default:'first') –- When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated: "first": returns the first mapped standardized name"last": returns the last mapped standardized nameFalse: returns all mapped standardized name.
When
keepisFalse, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
- When a synonym maps to multiple names, determines which duplicates to mark as
synonyms_field (
str, default:'synonyms') – A field containing the concatenated synonyms.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
list[str] |dict[str,str]- Returns:
If
return_mapperisFalse– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()Add synonyms.
remove_synonym()Remove synonyms.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> standardized_names = bt.Gene.standardize(gene_synonyms) >>> standardized_names ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str|None) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list[str] |Series|array) – Values that will be validated against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
ndarray- Returns:
A vector of booleans indicating if an element is validated.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> bt.Gene.validate(gene_symbols, field=bt.Gene.symbol) array([ True, True, False, False])
Methods¶
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonyms to add to the record.force (
bool, default:False) – Whether to add synonyms even if they are already synonyms of other records.save (
bool|None, default:None) – Whether to save the record to the database.
See also
remove_synonym()Remove synonyms.
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.add_synonym("T cells") >>> record.synonyms 'T cells|T-cell|T-lymphocyte|T lymphocyte'
- delete()¶
Delete.
- Return type:
None
- describe(return_str=False)¶
Describe schema.
- Return type:
None|str
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonym values to remove.
See also
add_synonym()Add synonyms
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.remove_synonym("T-cell") 'T lymphocyte|T-lymphocyte'
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str) – A value for an abbreviation.
See also
Examples
>>> import bionty as bt >>> bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() >>> scrna = bt.ExperimentalFactor.get(name="single-cell RNA sequencing") >>> scrna.abbr None >>> scrna.synonyms 'single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing' >>> scrna.set_abbr("scRNA") >>> scrna.abbr 'scRNA' >>> scrna.synonyms 'scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq' >>> scrna.save()