How to install and integrate jupyter notebook with PySpark
Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.¶
First of all, I have downloaded Spark from [here]. The required packages are listed below.¶
- Java SE Development kit (installed already but the version is not checked yet)
- Scala Build Tool (not installed yet)
- Spark 1.6.1 (installed already)
- Python 2.6 or higher (mine is 2.7)
- Jupyter Notebook (installed already)
To check the java version, run 'java -version' in terminal and it shows:¶
- java version "1.7.0_55"
- Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
- Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
To build Spark, I need to install the Scala Build Tool, run :¶
- brew install sbt
Head to the Spark downloads page [http://spark.apache.org/downloads.html] then¶
- choose the latest version [1.6.1 (Mar 09 26)] in step 1
- choose 'Pre-built for hadoop 2.6 and later' in step 2
- the step 3 is kept as the default setting
- click the download link in step 4.
### Once downloaded, the
tgzfile is unziped and moved to my home directory.
Since this latest version has implemented 'sbt assembly', there is no need to do this step as required by other earlier versions, e.g., spark-1.5.1. Just run¶
- ./bin/pyspark ### It is done!
My python's version is 2.7, which is fine. However, I have been using ipython notebook instead of jupyter notebook. An upgrade to jupyter is required. Here are the steps:¶
- Donwload Anaconda from [https://www.continuum.io/downloads]
- Install Anaconda by running in the terminal :
- bash Anaconda2-4.0.0-MacOSX-x86_64.sh
- cd [the project directory of all my python notebooks]
- jupyter notebook ### Jupyter is installed! If i check the path of python/ipython/jupyter directory by e.g., 'which python', a new path is returned [home directory/anaconda2/bin/python]. This seems ok, but it will cause a problem.
The problem is all the previously installed packages are not recognizable in the new open jupyter notebook. The solution is simple:¶
- open Pycharm or whatever your development kit for Python
- change the preferences-> project interpreter to the new python path as shown above
- add packages in pycharm. ### Done! The packages are installed under the new path.
One last thing for setting is to set up the 'SPARK_HOME' path. A simple solution is to¶
- add the package 'findspark' in pycharm
- import it in the notebook
- set up the SPARK_HOME environment variable in the cell by the path where SPARK is installed
- call 'findspark' to locate spark. ### All setting is done! Enjoy...
Better typography for IPython notebooks. A simple solution is to¶
- open terminal: vim ~/.jupyter/custom/custom.css
- open 'custom.css' with sublime text 2 (or other text editor)
- click this link [https://github.com/nsonnad/base16-ipython-notebook/blob/master/ipython-3/output/] and choose one of your favourite css style
- copy that css file to 'custom.css' ### A beautiful jupyter notebook is set up!
Demo time!¶
In [2]:
import findspark
import os
In [3]:
os.environ['SPARK_HOME']= '/Users/wangwei/spark-1.6.1-bin-hadoop2.6'
print os.environ.get('SPARK_HOME')
In [4]:
findspark.init()
In [5]:
import pyspark
sc = pyspark.SparkContext()