Cost Effective Data Pipelines example code repository
This repo contains code examples for Cost Effective Data Pipelines Available for purchase wherever books are sold including:
Please send any comments, concerns, or problems to [email protected]
Environment setup
- Install pyenv
- Install pyenv-virtualenv
- Install python 3.8.5
pyenv install 3.8.5 - Create virtualenv
pyenv virtualenv 3.8.5 oreilly-book - Activate the virtual environment
pyenv activate oreilly-book - Clone this repo
git clone [email protected]:gizm00/oreilly_dataeng_book.git cd oreilly_dataeng_bookpip install wheel- Install dependencies
python -m pip install -r requirements.txt
Running Spark locally
(based on these instructions)
Within the virtualenv created above run the following:
- Download apache-spark This material was developed using spark 3.2.1 with hadoop 3.2
- Move the tgz file to a place you will refer to it from, i.e. ~/Development/
tar -xvf ~/Development/spark-3.2.1-bin-hadoop3.2.tgz- Add the following to your shell startup file, for example ~/.bash_profile:
export SPARK_HOME="/User/sev/Development/spark-3.2.1-bin-hadoop3.2" export PATH="$SPARK_HOME/bin:$PATH" source ~/.bash_profilepyspark
If you use the VSCode IDE on OSX, you can run pyspark notebooks with these instructions
- When you start the notebook in VS Code choose the
oreilly-bookvenv as the python interpreter path