Models
A model consists of code that populates a single table or view, along with metadata properties that define how your users can interact with the model.
Models define a data processing step and frequently represent some bit of business logic. They can be written in SQL or Python. Models can reference each other, and SBT will automatically build your complete data processing pipeline by parsing the model code for dependencies. A bundle of models powers an app on the Insight Cloud platform.
Models all live in the models/
directory of your SBT project. A model consists of two parts:
- A definition - a python or sql file that contains the code for that model. The name of the file will become the unique identifier for the model.
- A configuration - a yaml file that contains metadata for the model. Must have a named config matching the model name from the definition file.
Example
Let's look at a simple example to illustrate how models work.
First we have the directory structure of the project:
sbt-project/
├── sources/
│ └── sources.yml
└── models/
├── my_first_model.sql
└── configs.yml
The model definition is a SQL file. It uses standard SQL plus some special syntax to access the Seek Run Context object. We can access sources, variables, and other model outputs using the seek
object. All of these variables are compiled at runtime based on the user's request. In this example, we are reading from an expected source table and adding a user-defined parameter to the output.
select
src.*,
'{seek.params[TEST_VAR]}' as user_param_value
from {seek.sources[vendor.table]} src
Next let's look at the model config file. It contains a yaml block like the below, which gives us the config for the model my_first_model
. The variables
key is a list of variables that users will be able to modify when running your app. The values_query
key is a query that returns the possible values for the variable - this is useful for powering dropdowns in Insight Cloud.
models:
- name: my_first_model
config:
language: sql
materialized: table
type: output
variables:
- name: TEST_VAR
display_name: Test Variable
type: VARCHAR
default: No test var set
values_query: values_test
Model Configs
The model config is contained in a yaml file. The filename doesn't matter, you can create multiple yaml files in nested directories if that suits your needs, or you can have one centralized yaml that holds all your model configs. SBT will search all yaml files for models:
keys to find a config that matches the model definition.:
Model Object Fields
Field | Description | Available Values |
---|---|---|
name | The name of the model, must be unique. Used as the name of tables created by this model. | string, must be unique |
config.packages | A list of python packages that need to be installed in the app environment to run this model. | list of strings, typical pip version pinning is supported |
config.materialized | The type of db object to create in snowflake | table or view |
config.language | The language of the model. | python or sql |
config.type | The type of model. | output or mapping |
config.variables | A list of model variables that can be set when running the model. | list of objects |
config.py_modules | A list of python wheel files that need to be installed in the snowflake environment to run this model. These should be located in the modules directory to be included with model deployments |
list of strings |
Model Variable Fields
Field | Description | Available Values |
---|---|---|
name | The name of the variable | string |
display_name | The name of the variable as it will be displayed to users | string |
type | The snowflake data type of the variable | string |
default | The default value of the variable | string |
values_query | A query that returns the possible values for the variable | string |
models:
- name: python_test
config:
packages:
- "snowflake-snowpark-python"
- "numpy"
- "scikit-learn"
- "pandas"
materialized: table
language: python
type: output
variables:
- name: TEST_VAR
display_name: Test Variable
type: VARCHAR
default: No test var set
values_query: values_test
py_modules:
- seek_cpg.whl
Seek Run Context
All of the magic of SBT happens via the Seek Run Context object. Models have access to an object called seek
which should be used to interact with the Seek platform. The seek object has properties for accessing sources, params, other models, and other current run context.
class SeekRunCtx(BaseModel):
model_name: str
target_name: str
current_db: str
params: dict
sources: dict
models: dict
Keep reading to see how to use the seek
object in your models.
Model Defintion
SQL Models
SQL models are select statements stored in a .sql
file. The SQL query defined in the file will eventually be wrapped in CREATE TABLE/VIEW AS SELECT statements. Wrap any referenes to the SeekRunContext
in single curly braces - remember this is going to end up getting formatted in python.
Note
For SQL models only - don't wrap the name of the source/model/param you are accessing in quotes. The below example is correctly not quoting atlas_luminate.atlas_itemlist_custom
SELECT * EXCLUDE EPOCH
RENAME "ITEM NBR" AS item_nbr
FROM {seek.sources[atlas_luminate.atlas_itemlist_custom]}
Python Models
Python models are regular python .py
files. They must contain at least one function called model
which takes two arguments, session
and seek
. The session
arg is a snowpark session used for interacting with snowflake. The seek
arg is the SeekRunContext object. The model
function should return a dataframe - could be snowpark or pandas. There can be as many helper functions as you like in the file, but the model
function is required.
Below is an example python model definition. This model reads from a source, does some transformations, and writes to an output table.
from snowflake.snowpark.functions import lit
def model(session, seek):
# Accessing seek sources works
base_table = session.table(seek.sources["atlas.view_items"])
pivot_values = base_table.select("Brand Desc").distinct()
# Accessing defined params works
var_df = pivot_values.with_column("vars", lit(seek.params["TEST_VAR"]))
# Accessing seek object properties works
other_df = var_df.with_column("database", lit(seek.model_name))
# Casting to pandas works
pandas_df = other_df.to_pandas()
# Return table gets written to output table
snowpark_df = session.createDataFrame(pandas_df)
return snowpark_df