Explore Directory

oreilly_dataeng_book

oreilly_dataeng_book

Open source Jupyter Notebook

16

Stars

2

Forks

0

Issues

1

Watchers

2 years

Last Commit

Data Pipelines & ETL PaaS & Self-hosting Education & Learning

About oreilly_dataeng_book

Cost effective data pipelines code repository

Platforms

Web Self-hosted Cloud

Languages

Jupyter Notebook

Links

Website Source Code

g

Published by

Visit View Profile

README.md

Cost Effective Data Pipelines example code repository

This repo contains code examples for Cost Effective Data Pipelines Available for purchase wherever books are sold including:

Please send any comments, concerns, or problems to [email protected]

Environment setup

Install pyenv
Install pyenv-virtualenv
Install python 3.8.5
pyenv install 3.8.5
Create virtualenv
pyenv virtualenv 3.8.5 oreilly-book
Activate the virtual environment
pyenv activate oreilly-book
Clone this repo
git clone [email protected]:gizm00/oreilly_dataeng_book.git
cd oreilly_dataeng_book
pip install wheel
Install dependencies
python -m pip install -r requirements.txt

Running Spark locally

(based on these instructions)
Within the virtualenv created above run the following:

Download apache-spark This material was developed using spark 3.2.1 with hadoop 3.2
Move the tgz file to a place you will refer to it from, i.e. ~/Development/
tar -xvf ~/Development/spark-3.2.1-bin-hadoop3.2.tgz

Add the following to your shell startup file, for example ~/.bash_profile:

export SPARK_HOME="/User/sev/Development/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"

source ~/.bash_profile
pyspark

If you use the VSCode IDE on OSX, you can run pyspark notebooks with these instructions

When you start the notebook in VS Code choose the oreilly-book venv as the python interpreter path