jacopotagliabue

Open Source

MLSys-NYU-2022

# MLSys-NYU-2022 Slides, scripts and materials for the _Machine Learning in Finance_ course at NYU Tandon, 2022. ## Overview DISCLAIMER: A significant part of our course is class participation (this is why, in the end, we have universities and not *just* books and repos!), and no amount of scripts can provide the same level of educational content, or a comparable experience. Please note that this course changes substantially every year, so the best way to keep up to date with us is... by enrolling in the [Master](https://engineering.nyu.edu/academics/programs/financial-engineering-ms)! TL;DR: This repository contains some of the teaching materials by Prof. [Ethan Rosenthal](https://www.ethanrosenthal.com/) and [myself](https://jacopotagliabue.it/) for the 2022 course in ML at the NYU Tandon School of Engineering. The course is presented as an introduction to Machine Learning with Finance use cases and industry-standard tools. We open source slides, code snippets and assignments after the class is completed, hoping to benefit the broader community of Machine Learning students and practitioners; I had a calculus book that said, "What one fool can do, another can.", and I wish more and more fools could become proficient at building reliable, trust-worthy, well-crafted ML systems. We feel there are now enough books and YouTube videos for people interested purely in the theory of ML; moreover, practitioners produce a much bigger marginal value when bringing into the class their day-to-day experience, which, for the time being, cannot be as easily found on YouTube. Therefore, the course we run is very practical and focuses on the intuitive understanding of ML problems and their solutions _through real-world tools_: we emphasize the importance of good coding habits, and the use of industry standard methodology, over complex modelling and formulas (alas, we do indeed sometimes need to talk about math). The whole course runs in 14 weeks, but we cover arguments that would keep you busy for a lifetime: every lecture, every slide, every code snippet are the result of many explicit and implicit trade-offs - what should we cover, what should we not? While no material can substitute for real-world interactions and our great sense of humour, we leave for the open source community to judge how useful the trade-offs we picked actually are. ## At a glance ### Main themes The course is structured around 14 weeks: 13 weeks of lectures, and 1 final demo day for students (organized in teams) to present an end-to-end machine learning project that showcases what they learned in the course. Main topics, roughly in order of appearance: * Introduction to ML in Finance: use cases, tools, the rise of MLOps. * Python setup for scientific computing: notebooks, environments, dependencies. * ML best practices: dataset split, hyper-parameter tuning. * Modelling: classification, regression. * Use case deep dive I: fraud detection. * MLOps best practices: experiment tracking, DAG-based pipelines, deployment. * Rounded evaluation: slice-based metrics, behavioral testing. * Introduction to embeddings: skip-gram, similarity in a latent space. * Use case deep dive II: recommender systems. ### Repo structure This year's repository is structured by week: each week has its folder, with a self-contained `README`, scripts / notebooks and slides. The choice results in some redundancy (especially in the second part, where the same training loop is used several times), but provides more clarity for students pacing themselves through the course, and highlights the highly modular nature of the syllabus. As part of the course, we emphasize the importance of virtual environments and submitting properly structured projects (notebooks are great, but we leave them for experimentation!). Each week contains a devoted `requirements.txt` file to make sure the scripts are reproducible. ### Changelog Compared to [2021 edition](https://github.com/jacopotagliabue/FREE_7773), most material is indeed new and re-vamped. As a non-exhaustive list: * new intro to Python / coding good practices; * fraud detection as a finance-specific ML use case; * NLP section has been replaced by a RecSys deep dive; * new section on the importance of metrics, and expanded discussion on evaluating models; * new tools: Metaflow sandbox and Streamlit apps. ## Tooling overview ### Intro to Python and Git Python is the main language for Machine Learning, but it is surprisingly hard to set up a [working environment](https://calmcode.io/virtualenv/intro.html). We introduce virtualenv and basics of git to get you started. #### Metaflow [Metaflow](https://metaflow.org/) is an open-source tool designed to simplify building, maintaining and deploying ML pipelines (e.g. [here](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat)). Starting this year, Outerbounds provided us with free [sandbox accounts](https://outerbounds.com/docs/sandbox/) (thank you!). #### Streamlit [Streamlit](https://streamlit.io/) turns Python scripts into web apps in minutes, helping with prototyping and sharing the results of our pipelines. Streamlit apps can be used to display artifacts from Metaflow, and [make the model interactive for non-technical stakeholders](https://github.com/jacopotagliabue/MLSys-NYU-2022/blob/main/weeks/10/app/app.py). #### Comet [Comet](https://www.comet.com/site/) is a machine learning platform that can help you manage, visualize, and optimize training runs. We use Comet to keep track of our experiments, and document our progress with the rest of our team and technical stakeholders (thank you for the free account!). #### Flask [Flask](https://flask.palletsprojects.com/en/2.2.x/) is a micro web framework written in Python. It allows us to build (in Python) APIs that serve predictions made by our trained model *in real-time*, and display the results in the browser. ## Acknowledgments Thanks to all outstanding people quoted and linked in the slides: this course is possible only because we truly stand on the shoulders of giants. Special thanks also to: * [Hugo](https://www.linkedin.com/in/hugo-bowne-anderson-045939a5/) for being our fantastic guest speaker on Metaflow; * [Chip](https://www.linkedin.com/in/chiphuyen/) for being our fantastic guest speaker on MLOps; * [Ciro](https://www.linkedin.com/in/cirogreco/) for being our fantastic guest speaker on industry applications of ML and ML careers; * [Gideon](https://www.linkedin.com/in/gideon-mendels/) for fantastic support and free Comet accounts for all the students. ## Suggested complementary / additional readings The main topics - Regression, Classification, Time Series, Fraud Detection, MLOps, RecSys etc. - are all huge, and we could obviously just scratch the surface. Aside from all the references to be found in the slides and READMEs, these are few good places to further explore this world. ### Machine Learning * [Deep Learning with Python](https://www.amazon.com/Learning-Python-Second-Fran%C3%A7ois-Chollet/dp/1617296864): great practical intro to ML concepts and the basics of DL using Keras. * [Machine Learning with PyTorch and Scikit-Learn](https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-learning-ebook/dp/B09NW48MR1): another great practical intro to ML / DL, focused on PyTorch. * [RecList](https://github.com/jacopotagliabue/reclist): our own open source project (library, paper, tutorials) for better testing of recommender systems. ### MLOps * [Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969): a book by Chip Huyen on how machine learning systems are designed, end to end - excellent introduction to basic MLOps concepts. * [Effective Data Science Infrastructure](https://www.manning.com/books/effective-data-science-infrastructure): a book by Ville Tuulos - Metaflow creator - on how to make data scientist productive. * [You Don't Need a Bigger Boat](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat): our own fully open source repository showing how state-of-the-art ML systems can be built at scale, component after component. * [Comet for Data Science](https://www.packtpub.com/product/comet-for-data-science/9781801814430): on how to take advantage of existing platform for experiment tracking, collaboration and managing the life-cycle of ML artifacts. ## Contacts For questions, feedback, comments, please drop me a message at: `jacopo dot tagliabue at nyu.edu`.

Education & Learning

556 Github Stars

post-modern-stack

# The Post-Modern Stack Joining the modern data stack with the modern ML stack ## Overview As part our TDS series on MLOps, our [blog post](https://towardsdatascience.com/the-post-modern-stack-993ec3b044c1) shows how a _post-modern stack_ works, by deconstructing (see the pun?) our original [YDNABB](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat) repo into the few fundamental pieces owning the actual compute: a data warehouse for dataOps, and Metaflow on AWS for MLOps. A quick, high-level walk-through of the stack can be found in our intro video: [![YouTube intro video](/images/youtube.png)](https://www.youtube.com/watch?v=5kHDb-XGHtc) As a use case, we pick a popular RecSys challenge, session-based recommendation: given the interactions between a shopper and some products in a browsing session, can we train a model to predic what the next interaction will be? The flow is powered by our open-source [Coveo Data Challenge dataset](https://github.com/coveooss/SIGIR-ecom-data-challenge) - as model, we train a vanilla LSTM, a model just complex enough to make good use of cloud computing. At a quick glance, this is what we are building: ![The post-modern stack](/images/stack.jpg) As usual, we show a working, end-to-end, real-world flow: while you can run it locally with few thousands sessions to get the basics, we suggest you to use the `MAX_SESSIONS` variable to appreciate how well the stack scales - with no code changes - as millions of events are pushed to the warehouse. For an in-depth explanation of the philosophy behind the approach, please check the companion [blog post](https://towardsdatascience.com/the-post-modern-stack-993ec3b044c1), and the previous episodes / repos in [the series](https://towardsdatascience.com/tagged/mlops-without-much-ops). ## Pre-requisites The code is a self-contained recommender project; however, since we leverage best-in-class tools, some preliminary setup is required. Please make sure the requirements are satisfied, depending on what you wish to run - roughly in order of ascending complexity: _The basics: Metaflow, Snowflake and dbt_ A Snowflake account is needed to host the data, and a working Metaflow + dbt setup is needed to run the flow; we *strongly* suggest to run `Metaflow on AWS` (as it is the intended setup), but with some minor modifications you should be able to run the flow with a local store as well. * _Snowflake account_: [sign-up for a free trial](https://signup.snowflake.com). * _AWS account_: [sign-up for a free AWS account](https://aws.amazon.com/free/). * _Metaflow on AWS_: [follow the setup guide](https://docs.metaflow.org/metaflow-on-aws). * _dbt core setup_: on top of installing the open source package (already included in the `requirements.txt`), you need to point dbt to your Snowflake instance with the proper [dbt_profile](https://docs.getdbt.com/dbt-cli/configure-your-profile), Make sure the SCHEMA there matches with what is specified in the `.env` file (`SF_SCHEMA`). _Adding experiment tracking_ * _Comet ML_: [sign-up for free](https://www.comet.ml/signup) and get an api key. If you don't want experiment tracking, make sure to comment out the Comet specific parts in the `train_model` step. _Adding PaaS deployment_ * _SageMaker setup_: To deploy the model as a PaaS solution using SageMaker, the `IAM_SAGEMAKER_ROLE` parameter in the flow needs to contain a suitable IAM ROLE to deploy an endpoint and access the s3 bucket where Metaflow is storing the model artifact; if you don't wish to deploy your model, run the flow with `SAGEMAKER_DEPLOY=0` in the `.env` file. _Adding dbt cloud_ * _dbt cloud account_: [sign-up for free](https://www.getdbt.com/signup) and get an api key. If you don't wish to use dbt cloud but just the local setup,set `DBT_CLOUD=0` in the `.env` file. ## Setup ### Virtual env Setup a virtual environment with the project dependencies: * `python -m venv venv` * `source venv/bin/activate` * `pip install -r requirements.txt` NOTE: the current version of RecList has some old dependencies which may results in some (harmless) pip conflicts - conflicts will disappear with the new version, coming out soon. Create a local version of the `local.env` file named only `.env` (do _not_ commit it!), and make sure to fill its values properly: | VARIABLE | TYPE | MEANING | | ------------- | ------------- | ------------- | | SF_USER | string | Snowflake user name | | SF_PWD | string | Snowflake password | | SF_ACCOUNT | string | Snowflake account | | SF_DB | string | Snowflake database | | SF_SCHEMA | string (suggested: POST_MODERN_DATA_STACK) | Snowflake schema for raw and transformed data | | SF_TABLE | string (COVEO_DATASET_RAW) | Snowflake table for raw data | | SF_ROLE | string | Snowflake role to run SQL | | APPLICATION_API_KEY | uuid (474d1224-e231-42ed-9fc9-058c2a8347a5) | Organization id to simulate a SaaS company | | MAX_SESSIONS | int (1000) | Number of raw sessions to load into Snowflake (try first running the project locally with a small number) | | EN_BATCH | 0-1 (0) | Enable/disable cloud computing for @batch steps in Metaflow (try first running the project locally) | | COMET_API_KEY | string | Comet ML api key | | DBT_CLOUD| 0-1 (0) | Enable/disable running dbt on the cloud | | SAGEMAKER_DEPLOY| 0-1 (1) | Enable/disable deploying the model artifact to a Sagemaker endpoint | | DBT_ACCOUNT_ID | int | dbt cloud account id (you can find it in the dbt cloud URL) | | DBT_PROJECT_ID | int | dbt cloud project id (you can find it in the dbt cloud URL) | | DBT_JOB_ID | int | dbt cloud job id (you can find it in the dbt cloud URL) | | DBT_API_KEY| string | dbt cloud api key | ### Load data into Snowflake Original datasets are from the Coveo SIGIR Data Challenge. To save you from downloading the original data dump and dealing with large text files, we re-used the abstraction over the data provided by RecList. If you run `upload_to_snowflake.py` in the `upload` folder from your laptop as a one-off script, the program will download the Data Challenge dataset and dump it to a Snowflake table that simulates the [append-only log pattern](https://towardsdatascience.com/the-modern-data-pattern-d34d42216c81). This allows us to use dbt and Metaflow to run a realistic ELT and ML code over real-world data. Once you run the script, check your Snowflake for the new schema/table: ![Raw table in Snowflake](/images/raw_table.png) If you wish to see how a data ingestion pipeline works (i.e. an endpoint streaming into Snowflake real-time, individual events, instead of a bulk upload), we open-sourced a [serverless pipeline](https://github.com/jacopotagliabue/paas-data-ingestion) as well. ### dbt While we will run dbt code as part of Metaflow, it is good practice to try and see if everything works from a stand-alone setup first. To run and test the dbt transformations, just `cd` into the `dbt` folder and run `dbt run --vars '{SF_SCHEMA: POST_MODERN_DATA_STACK, SF_TABLE: COVEO_DATASET_RAW}'`, where the [variables](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-variables) reflect the content of your `.env` file (you can also run `dbt test`, if you like). Once you run dbt, check your Snowflake for the views: ![Views in Snowflake](/images/after_dbt.png) The `DBT_CLOUD` variable (see above) controls whether transformations are run from _within the flow folder_, or from a dbt cloud account, by using dbt API to trigger the transformation on the cloud platform. If you want to leverage dbt cloud, make sure to manually [create a job](https://docs.getdbt.com/docs/dbt-cloud/cloud-quickstart#create-a-new-job) on the platform, and then configure the relevant variables in the `.env` file. In our tests, we used the exact same `.sql` and `.yml` files that you find in this repository: <img src="/images/dbt_cloud.png" height="250"> Please note that instead of having a local dbt folder, you could have your dbt code in a Github repo and then either clone it using Github APIs at runtime, or import it in dbt cloud and use the platform to run the code base. ## How to run (a.k.a. the whole enchilada) ### Run the flow Once the above setup steps are completed, you can run the flow: * cd into the `src` folder; * run the flow with `METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py --package-suffixes ".py" run --max-workers 4`, where `METAFLOW_PROFILE` is needed to select a specific Metaflow config (you can omit it, if you're using the default), `AWS_PROFILE` is needed to select a specific AWS config that runs the flow and it's related AWS infrastructure (you can omit it, if you're using the default), and `AWS_DEFAULT_REGION` is needed to specify the target AWS region (you can omit it, if you've it already specified in your local AWS PROFILE and you do not wish to change it); * visualize the performance card with `METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view test_model --id recCard` (see below for an intro to [RecList](https://github.com/jacopotagliabue/reclist)). ### Results If you run the fully-featured flow (i.e. `SAGEMAKER_DEPLOY=1`) with the recommended setup, you will end up with: * an up-to-date view in Snowflake, leveraging dbt to make raw data ready for machine learning; * versioned datasets and model artifacts in your AWS, accessible through the standard [Metaflow client API](https://docs.metaflow.org/metaflow/client); * a Comet dashboard for experiment tracking of the deep learning model, displaying training stats; * a versioned Metaflow card containing (some of) the tests run with RecList (see below); * finally, a DL-based, sequential recommender system serving predictions in real-time using SageMaker for inference. If you log in into your AWS SageMaker interface, you should find the new endpoint for next event prediction available for inference: ![aws sagemaker UI](/images/aws_sagemaker.png) If you run the flow with dbt cloud, you will also find the dbt run in the history section on the cloud platform, easily identifiable through the flow id and user. ![dbt run history](/images/dbt_run_history.png) ### BONUS: RecList and Metaflow cards The project includes a (stub of a) custom [DAG card](https://outerbounds.com/blog/integrating-pythonic-visual-reports-into-ml-pipelines/) showing how the model is performing according to [RecList](https://github.com/jacopotagliabue/reclist), our open-source framework for behavioral testing. We could devote an [article](https://towardsdatascience.com/ndcg-is-not-all-you-need-24eb6d2f1227) / [paper](https://arxiv.org/abs/2111.09963) just to this (as we actually did recently!); you can visualize it with `METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view test_model --id recCard` at the end of your run. No matter how small, we wanted to include the card/test as a reminder of _how important is to understand model behavior before deployment_. Cards are a natural UI to display some of the RecList information: since readable, shareable (self-)documentation is crucial for production, RecList new major release will include out-of-the-box support for visualization and reporting tools: reach out if you're interested! As a *bonus* bonus feature (thanks Valay for the snippet!), *only when running with the dbt core setup*, the (not-production-ready) function `get_dag_from_manifest` will read the local manifest file and produce a dictionary compatible with Metaflow Card API. If you type `METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view run_transformation --id dbtCard` at the end of a successful run, you should see a card displaying the dbt card _as a Metaflow card_, as in the image below: ![dbt card on Metaflow](/images/dbt_card.png) We leave to the reader (and / or to future iterations) to explore how to combine dbt, RecList and other info into a custom, well-designed card! ## What's next? Of course, the post-modern stack can be further expanded or improved in many ways. Without presumption of completeness, these are some ideas to start: * on the dataOps side, we could include some data quality checks, either by improving our dbt setup, or by introducing additional tooling: at [reasonable scale](https://towardsdatascience.com/hagakure-for-mlops-the-four-pillars-of-ml-at-reasonable-scale-5a09bd073da) the greater marginal value is typically to be found in better data, as compared to better models; * on the MLOps side, we barely scratched the surface: one side, we kept the modeling simple and avoid any tuning, which is however very easy to do using Metaflow built-in parallelization abilities; on the other, you may decide to complicate the flow with other tools, improve on serving etc. (e.g. the proposal [here](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat)). Swapping in-and-out different tools with similar functionalities should be easy: in a previous work, we [abstracted away experiment tracking](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat/blob/main/local_flow/rec/src/utils.py) and allow users to pick [Neptune](https://neptune.ai/) as an alternative SaaS platform. Similar considerations apply to this use case as well; * a proper RecList for this flow is yet to be developed, as the current proposal is nothing more than a stub showing how easy it is to run a devoted test suite when needed: you can augment the simple suite we prepared, improve the visualization on cards or both - since RecList roadmap is quickly progressing, we expect a deeper integration and a whole new set of functionalities to be announced soon. Stay tuned for our next iteration on this! Is this the *only* way to run dbt in Metaflow? Of course not - in particular, you could think of writing a small wrapper around a flow and a dbt-core project that creates individual Metaflow steps corresponding to individual dbt steps, pretty much like suggested [here](https://www.astronomer.io/blog/airflow-dbt-1/) for another orchestrator. But this is surely a story for another repo / time ;-) ## Acknowledgements Special thanks to Sung Won Chung from dbt Labs, Hugo Bowne-Anderson, Gaurav Bhushan, Savin Goyal, Valay Dave from Outerbounds, Luca Bigon, Andrea Polonioli and Ciro Greco from Coveo. If you liked this project and the related article, please take a second to add a star to this and our [RecList](https://github.com/jacopotagliabue/reclist) repository! Contributors: * [Jacopo Tagliabue](https://www.linkedin.com/in/jacopotagliabue/), general design, Metaflow fan boy, prototype. * [Patrick John Chia](https://www.linkedin.com/in/patrick-john-chia/), model, deployment and testing. ## License All the code in this repo is freely available under a MIT License, also included in the project.

ML Frameworks Data Pipelines & ETL

202 Github Stars

paas-data-ingestion

# paas-data-ingestion Ingest and prepare data with AWS lambdas, Snowflake and dbt in a scalable, fully replayable manner. ## Overview This repository contains a fully PaaS infrastructure for data ingestion and transformation at scale. This repository has a companion [blog post](https://towardsdatascience.com/the-modern-data-pattern-d34d42216c81), to which we refer the reader for in-depth analysis of the proposed patterns and the motivations behind the project. While you can enjoy this repository in isolation, please note that our pipeline is part of a bigger architecture, detailed in our [MLOps without Much Ops series](https://towardsdatascience.com/tagged/mlops-without-much-ops). The ingestion pipeline mimics a typical data flow for data-driven applications: clients send events, an endpoint collects them and dumps them into a stream, finally a data warehouse stores them for further processing: <img src="https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/data_pattern_viz.jpg" width="640"> We make use of three main technologies: * [Pulumi](https://www.pulumi.com/), which allows us to manipulate infrastructure-as-code, and to do so in a language we are very familiar with, Python (this is our first project using it coming from Terraform, so any feedback is very welcome!). * [Snowflake](https://signup.snowflake.com/), which allows us to store raw data at scale and manipulate it with powerful SQL queries, abstracting away all the complexity of distributed computing in a simple API. * [dbt](https://www.getdbt.com/), which allows us to define data transformation as versioned, replayable DAGs, and mix-and-match materialization strategies to suite our needs. Finally, it is worth mentioning that the repository is e-commerce related as a results of mainly three factors: * our own experience in building [pipelines for the industry](https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat); * the availability of high-volume, [high-quality data](https://github.com/coveooss/SIGIR-ecom-data-challenge) to simulate a realistic scenario; * finally, the high bar set by e-commerce as far as data quantity and data-driven applications: due to the nature of the business and the competition, even medium-sized digital shops tend to produce an enormous amount of data, and to run pretty sophisticated ML flows on top it - in other words, something that works for e-commerce is going to reasonably work for many other use cases out of the box, as they are likely less data intensive and less sophisticated ML-wise. Our goal was to keep the code realistic enough for the target use cases, but simple enough as to make it easy for everybody to port this stack to a different industry. ## Prerequisites * _Dataset_: we use the [Coveo Data Challenge dataset](https://github.com/coveooss/SIGIR-ecom-data-challenge) as our real-world dataset (if you prefer to use your own data, you can still re-use the `pumper.py` app if you adapt the format to your input files). * _Snowflake account_: [sign-up for a 30-day free trial](https://signup.snowflake.com). * _Pulumi account_: [sign-up for a free Individual account](https://www.pulumi.com). * _AWS account_: [sign-up for a free AWS account](https://aws.amazon.com/free/) (S3, API Gateway & Lambda have a free tier). Please note that by running this project you *may incur* in some cloud costs: make sure to keep costs and infrastructure monitored and to understand how much your experiments are covered by the free tier. ## Structure The project is divided in three main components and a simulation script. ### Infrastructure Following the infrastructure-as-code paradigm, the folder contains the Pulumi code necessary to properly set up the AWS (lambda function, API Gateway, Cloudwatch, Firehose, etc.) and Snowflake components needed for the ingestion platform to work. For instructions on how to run it, see below. [infrastructure/__main__.py](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/infrastructure/__main__.py) checks which services have to be created/updated/removed. The script defines: - `s3_logs_bucket` is the S3 bucket we use to store the logs sent to Firehose; - `sf_database` is the database on Snowflake; - `sf_warehouse` is the Snowflake warehouse; - `sf_roles` is a custom module to create and manage two roles: - a Read-Write `*_RW` profile (for admins), - a Read-Only `*_RO` profile; - `sf_stream_schema` is the schema for RAW logs; - `sf_stream_logs_snowpipe` is a custom module to configure Snowpipe and the policies required to read the data stored on your bucket; - `self.firehose` is the [Firehose](https://aws.amazon.com/it/kinesis/data-firehose/) stream where RAW logs are sent by the ingestion endpoint (Lambda Function); - `self.table` is the final table where RAW logs are stored by Snowpipe; - `self.storage_integration`, `self.stage` & `self.pipe` define the Snowflake resource required to allow Snowpipe to read your data; - `aws.s3.BucketNotification` notifies Snowpipe every time a new object is stored (by Firehose) on your bucket; - `api` & `api_stage` define [API Gateway](https://aws.amazon.com/it/api-gateway/), a public entry-point for the Lambda function; - `lambda_api_collect` is a custom module to define the Lambda function for the `/collect` endpoint. ### dbt The folder contains a typical dbt project, whose goal is to collect, organize, version all the transformations needed to go from the raw data ingested by the pixel endpoint to the pre-calculated features and aggregations that can be consumed by downstream processes (BI, ML, etc.). For instructions on how to run the dbt project, see below. We used dbt to process RAW logs and to normalize the data into 3 different schemes: - `EVENTS`, stored on [dbt/models/events/marts](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/events/marts), contains all the materialized tables (`events`, `logs`, `pageviews`, `product_actions`, `purchases` & `user_agents`) with the data computed during the last execution; - `EVENTS_LIVE`, stored on [dbt/models/events/live](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/events/live), contains the same tables above (`events`, `logs` etc...) updated in real-time; the views are the result of the union of the materialized table and the new events; - `STATS`, stored on [dbt/models/stats](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/stats), defines materialized tables (`overview_1h`, `overview_1d`) containing all the main pre-calculated stats a dashboard can use (`pageviews`, `add_to_cart`, `transactions`, `transaction_revenue` etc). We define the queries in macros because the tables stored on the `EVENTS` & `EVENTS_LIVE` schemes are the same but contain data from different analysis periods. For example, the `logs` table containing all the sessionized and enriched logs is used as to power the other tables: - [dbt/models/events/marts/logs.sql](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/events/marts/logs.sql) is the materialized version of the table (`EVENTS.LOGS`); we use an incremental materialization and the SQL query is defined on the [dbt/macros/models/events/select_logs.sql](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/macros/models/events/select_logs.sql) table; - [dbt/models/events/live/logs_live.sql](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/events/live/logs_live.sql) is the live version of the table (`EVENTS_LIVE.LOGS`); the script does a UNION between all the events stored on `EVENTS.LOGS` with all the new events defined on the `EVENTS.LOGS_STAGED` table ([dbt/models/events/live/logs_staged.sql](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/models/events/live/logs_staged.sql)) Both the `EVENTS.LOGS` and the `EVENTS_LIVE.LOGS_STAGED` table uses the same macro [dbt/macros/models/events/select_logs.sql](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/dbt/macros/models/events/select_logs.sql) but with different filters. Finally, for pedagogical purposes, we have integrated the [UAParser.js](https://github.com/faisalman/ua-parser-js/tree/master/dist) library for parsing the user-agents directly on Snowflake. Please see the *Bonus* section below for a more detailed functional explanation of the overall pattern. As a visual overview, this is the DAG generated by dbt when describing the data transformations we prepared: <img src="https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/dbt_graph.png" width="400"> These screenshots should get you a sense of how the shape of the data changes from the original [log table](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/raw_sf.png) storing the [JSON event](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/body_sf.png) from the endpoint, to the final [aggregations](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/counts_sf.png). ### AWS lambda A simple AWS lambda function implementing a [data collection pixel](https://medium.com/tooso/serving-1x1-pixels-from-aws-lambda-endpoints-9eff73fe7631). Once deployed, the resulting `/collect` endpoint will accept `POST` requests from clients sending e-commerce data: the function is kept simple for pedagogical reason - after accepting the body, it prepares a simple but structured event, and uses another AWS PaaS service, Firehose, to dump it in a stream for downstream storage and further processing. ### Data pumper To simulate a constant stream of events reaching the collect endpoint, we provide a script that can be run at will to upload e-commerce events in the [Google Analytics](https://developers.google.com/analytics/devguides/collection/protocol/v1/parameters) format. The events are based on the real-world clickstream dataset open sourced in 2021, the [Coveo Data Challenge](https://github.com/coveooss/SIGIR-ecom-data-challenge) dataset. By using real-world anonymized events we provide practitioners with a realistic, non-toy scenario to get acquainted with the design patterns we propose. Please cite our work, share / star the repo if you find the dataset useful! ## How to run it Running the stack involves running three operations: * setting up the infrastructure; * send data; * run dbt transformations. ### Setting up the infrastructure 1. [Install Pulumi](https://www.pulumi.com/docs/get-started/install/) on your computer and configure your [Pulumi account](https://www.pulumi.com/docs/reference/cli/pulumi_login/): ```sh pulumi login ``` 2. Jump into the project folder & setup the python venv ```sh cd infrastructure make install ``` 3. Create a new Pulumi Stack: ```sh pulumi stack init dev ``` 4. Configure the new stack with all the required credentials: ```sh # AWS pulumi config set aws:region <value> pulumi config set aws:accessKey <value> pulumi config set aws:secretKey <value> --secret # Snowflake pulumi config set snowflake:region <value> pulumi config set snowflake:account <value> pulumi config set snowflake:password <value> --secret pulumi config set snowflake:username <value> ``` All the configurations will be stored on the `Pulumi.dev.yaml` file. 5. Deploy the stack: ```sh make up ``` Notes: - Make sure to use the [AWS credentiales with Administrator permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html); - Use the same region for AWS and Snowflake (e.g. `us-east-1`); - Follow [this guide](https://docs.snowflake.com/en/user-guide/admin-account-identifier.html) to identify your Snowflake account & region name. ### Send data with the pumper * To pump data into the newly created stack, `cd` into the [utils/pumper](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/utils/pumper) folder, and run `make install`. * Download and unzip the [Coveo Data Challenge dataset](https://github.com/coveooss/SIGIR-ecom-data-challenge) into `utils/pumper/dataset`. Make sure the target folder directly contains the `.csv` files, without additional nesting. * Create a copy of `.env.local` named `.env`, and use the dataset path as the value for `DATA_FOLDER`, the AWS url for the lambda function for `COLLECT_URL`. * Run `make pump` to start sending Google Analytics events to the endpoint. The script will run until `N_EVENTS` has been sent (change the variable in your `.env` file to what you like). At every run, [pumper.py](https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/utils/pumper/pumper.py) will send events as they are happening in that very moments: so running the code two times will not produce duplicate events, but events with similar categorical features and different id, timestamp etc. Please note that if you want to jump start the log table by bulk-loading the dataset (or a portion of it) to Snowflake, you can avoid some idle time waiting for events to be sent by using the [copy into](https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html) function over the raw csv. ### Run dbt transformations 1. Jump into the project folder & setup the python venv ```sh cd dbt make install ``` 2. Configure your dbt profile following the [official guide](https://docs.getdbt.com/dbt-cli/configure-your-profile): ```yaml paas-data-ingestion: target: dev outputs: dev: type: snowflake account: <value> user: <value> password: <value> role: ACCOUNTADMIN database: "PAAS_DATA_INGESTION_DEV" warehouse: "PAAS_DATA_INGESTION_DEV_WH" schema: RAW threads: 5 client_session_keep_alive: False ``` 5. Launch DBT: ```sh make dbt-run make dbt-rest make dbt-docs ``` ## Bonus: mix and match materialization options A basic computation in almost any analytics platform is the "sessionization" of clickstream data ingested from the app. That is the post hoc grouping (_inside the data warehouse_) of the uninterrupted stream of events into "sessions", which conventionally are defined with a 30 minutes threshold (i.e. given a user, two events 29 minutes apart are in the same session, if 31 minutes have passed, we now have two sessions): <img src="https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/sessionization.jpg" width="640"> Sessionization requires some heavy lifting with window functions, so we would like to run that computation in batch without much latency constraints every X hours; on the other hand, it would be nice if recent events (e.g. session counts) could still be actionable in the stack. The stylized flow below shows how the tables in the repo can be used to solve the problem within a single stack: <img src="https://github.com/jacopotagliabue/paas-data-ingestion/blob/main/docs/union.jpg" width="640"> Please refer to our [blog post](https://towardsdatascience.com/the-modern-data-pattern-d34d42216c81) for more details. ## Contributors This project has been brought to you with love by: * [Luca Bigon](https://www.linkedin.com/in/bigluck/): design, infrastructure, SQL-ing * [Jacopo Tagliabue](https://www.linkedin.com/in/jacopotagliabue/): design, data A lot of people helped with the [blog post](https://towardsdatascience.com/the-modern-data-pattern-d34d42216c81): please refer to our article for more details. ## License The code is provided "as is" and released under an open MIT License.

FaaS & Serverless Data Pipelines & ETL

70 Github Stars

Software by jacopotagliabue

MLSys-NYU-2022

post-modern-stack

paas-data-ingestion