Quickstart
Full code here
The most simple graph is two nodes with no time component, no aggregations,
and no labels. We are using customer (cust.csv) and order data (orders.csv).
In plain terms what the below code does is as follows:
- Create prefixes for each node so we always know where the column originated from after joinining the two datasets.
- Instantiate two
DynamicNodeinstances: one node forcust.csvand another fororders.csv. - Instantiate the
GraphReduceobject to house the compute graph. We are specifying that thecust.csvnode is theparent_node, which means all data will be joined to and aggregated to thecust.csvnode. In cases where we reduce the data the resulting dataset should be at the ganularity of theparent_nodedimension. - Add the nodes.
- Add the edge between the nodes.
- Execute the computation with
GraphReduce.do_transformations()the primary entrypoint to execution. - Dump out the head of the computed dataframe.
from graphreduce.node import GraphReduceNode, DynamicNode
from graphreduce.graph_reduce import GraphReduce
from graphreduce.enum import ComputeLayerEnum as GraphReduceComputeLayerEnum, PeriodUnit
# Need unique prefixes for all nodes
# so when columns are merged we know
# where they originate from.
prefixes = {
'cust.csv' : {'prefix':'cu'},
'orders.csv':{'prefix':'ord'}
}
# create graph reduce nodes
gr_nodes = {
f.split('/')[-1]: DynamicNode(
fpath=f,
fmt='csv',
pk='id',
prefix=prefixes[f]['prefix'],
date_key=None,
compute_layer=GraphReduceComputeLayerEnum.pandas,
compute_period_val=730,
compute_period_unit=PeriodUnit.day,
)
for f in files.keys()
}
# Instantiate GraphReduce with params.
# We are using 'cust.csv' as parent node
# so the granularity should be at the customer
# dimension.
gr = GraphReduce(
name='starter_graph',
parent_node=gr_nodes['cust.csv'],
fmt='csv',
cut_date=datetime.datetime(2023,9,1),
compute_layer=GraphReduceComputeLayerEnum.pandas,
auto_features=True,
auto_feature_hops_front=1,
auto_feature_hops_back=2,
label_node=gr_nodes['orders.csv'],
label_operation='count',
label_field='id',
label_period_val=60,
label_period_unit=PeriodUnit.day
)
# Add the nodes to the GraphReduce instance.
gr.add_node(gr_nodes['cust.csv'])
gr.add_node(gr_nodes['orders.csv'])
gr.add_entity_edge(
parent_node=gr_nodes['cust.csv'],
relation_node=gr_nodes['orders.csv'],
parent_key='id',
relation_key='customer_id',
reduce=True
)
gr.do_transformations()
gr.parent_node.df.head()
cu_id cu_name ord_customer_id ord_id_count ord_customer_id_count ord_ts_min ord_ts_max ord_amount_count ord_customer_id_dupe ord_id_label
0 1 wes 1 3 3 2023-05-12 2023-09-02 3 1 3
1 2 ana 2 3 3 2022-08-05 2023-10-15 3 2 3
2 3 caleb 3 1 1 2023-06-01 2023-06-01 1 3 1
3 4 luly 4 2 2 2024-01-01 2024-02-01 2 4 2