Examples
Relational schema inference
This tutorial covers the basic features of extracting relational metadata from a directory of files. The same process would be followed for a data lake / data warehouse.
- Log in to kurve demo
- Look under Sample data sources for
/usr/local/lake/cust_data
and click Create Graph - In a few seconds the page should refresh and you can view the graph by clicking View Graph
- View all of the foreign key relationships between the 6 tables:
cust.csv
connects withorders.csv
onid -> customer_id
cust.csv
connects withnotifications.csv
onid -> customer_id
orders.csv
connects withorder_products.csv
onid -> order_id
notifications.csv
connects withnotification_interactions.csv
onid -> notification_id
notification_interactions.csv
connects withnotification_interaction_types.csv
oninteraction_type_id -> id
In this example Kurve points at the 6 csv files, runs an algorithm, and extracts all of the relational metadata. Next we'll look at how to run computation on these relational metadata graphs.
Orienting schema graphs around particular nodes
Most analytics and AI problems are oriented around a particular dimension. With a graph structure we can easily manipuate the schema graph
around the dimension of interest. For this example we'll orient the /usr/local/lake/cust_data
sample schema around the cust.csv
. To do this we'll assign the cust.csv
as the parent node:
- Under Actions in the schema graph viewer click Assign Parent Node and select
cust.csv
withdepth=1
. - You should have a subgraph of 3 tables now:
cust.csv
,orders.csv
, andnotifications.csv
- Now that we've oriented the dataset around the
cust.csv
dimension let's get to compute / data integration.
Compute graph basics
- Visit the schema graph for
/usr/local/lake/cust_data
. - Click Actions, Assign Parent Node, select
cust.csv
withdepth = 1
. - Click Actions, Compute Graph and plug in the following values:
- Name: kurve cust demo
- Parent Node: cust.csv
- Depth limit: 1
- Compute period in days: 365
- Cut date: 05/01/2023
- Label period in days: 90
- Label node: notifications.csv
- Label field: id
- Label operation: count
- Notice the color coating of the parent node, cust.csv, and the label node, notifications.csv.
- Let's make sure our parameters are correct: click Actions, Show compute graph details. Here is what it should look like:
- If all of that looks good now let's execute the compute graph:
- click Actions
- click Execute compute graph
- navigate home
- you should now see a data source under My data sources, click it and find the
kurve_cust_demo_train
table or whatever name you used and click on the table - notice all of the columns but we should only have 4 rows, this is all 3 tables aggregated and integrated to the
cust.csv
dimension with point-in-time correctness based on the above parameters.
Node-level operations customization (basic)
In the compute graph created in the above example we leaned into Kurve's automation, but let's customize some things and see the effect.
-
Visit the compute graph under
/usr/local/lake/cust_data
-
Click the parent node
cust.csv
-
Click Edit Node and under Annotations / special selects enter the following:
select *, length(name) as name_length
-
Under Filters enter the following:
where name_length < 4
5. Execute the compute graph 6. Navigate home and click on the compute graph output table under My data sources and you shuld now see only 3 rows.
In summary, we annotated the cust.csv
node with a new column called name_length
and then filtered the dataset to only contain rows where name_length < 4
, which filtered one row. This highlights how we can customize compute graphs in a basic way. The next example will do this in a more advanced way.
Node-level operations customization (advanced)
Continuing with the same schema and compute graph from the prior examples.
post-join annotation
We need to compute the difference between the notifications.csv
and the orders.csv
timestamps, which will require
they both be joined to the cust.csv
and then a DATEDIFF
operation needs to be called between the dates. The order
of operations is as follows:
- Join
orders.csv
tocust.csv
without aggregating - Join
notifications.csv
tocust.csv
without aggregating - Compute date difference between
orders.csv
timestamp andnotifications.csv
timestamp
To accomplish this in Kurve we just need to specify a post-join annotation definition on the cust.csv
as follows:
Notice the highlighted portion of the screenshot which shows the nodes that the post-join annotate
operation depends on. This tells Kurve that to apply the defined operation under post-join annotate we need bboth the orders.csv
and notifictions.csv
merged first. The SQL we're running is:
select *,
DATEDIFF('DAY', ord_ts, not_ts) as order_notification_timediff
Additionally, we need to update both edges to not be aggregated prior to join:
Finally, we can execute the compute graph. Once it is executed notice the newly added column in the dataset. Also notice that we don't have just 4 rows anymore since we did not aggregate the child relationships!
post-join filters
In post-join annotation we created a new column that depended on 2 foreign relationships. We'll define a filter which depends on the same 2 relationships and applies to the column created by post-join annotate.
Now re-execute the compute graph and view the output. There shouldn't be any negatives for order_notification_timediff
:
Snowflake semantic views
We're using the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1
database for this example, which is in all Snowflake accounts. This is also one of the data sources used in the Snowflake semantic views documentation.
Connect snowflake source
Navigate to the data sources section of Kurve and add the Snowflake data source as shown in the screenshot:
Infer relationships between snowflake tables
After adding the Snowflake data source we can see it under My data sources in the homepage. Now we'll go ahead and click Create graph to have Kurve infer all of the primary keys, foreign keys, and date keys between tables:
Inspect Kurve metadata graph output
After Kurve finishes running we can open the metadata graph and view what it extracted:
- Notice the icons next to the columns of the
ORDERS
table in the image. - It looks like Kurve selected
O_ORDERKEY
as the primary key andO_ORDERDATE
as the date key. - Notice the relationships / edges on the right hand side.
- These are what we'll use to dynamically generate the Snowflake semantic view
Generate the Snowflake Semantic View
Under Actions you will see Semantic generator if the data source is Snowflake:
After executing a pop up will return with the semantic view Snowflake SQL definition pre-populated with:
- primary keys for all tables in the graph
- relationships for all tables in the graph
- metrics, dimensions, and facts are left for the user to define
Finally, we can copy/paste this into Snowflake, add final modifications, and execute: