Work built over the work of “Bryant Crocker at https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867
docker run -it -p 8888:8888 jupyter/pyspark-notebook
From
To access the server, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/jpserver-6-open.html
Or copy and paste one of these URLs:
http://9c1bc9bbe072:8888/lab?token=27c58547418746f661649a5798a2cf2165e6670cffb49b59
http://127.0.0.1:8888/lab?token=27c58547418746f661649a5798a2cf2165e6670cffb49b59
Copy the token token from the terminal.
Open http://localhost:8888
.
Paste the token on the open text box.
Test things out:
import pyspark
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Download the data from https://data.vermont.gov/Finance/Vermont-Vendor-Payments/786x-sbp3/about_data by using the export
button.
Copy file from host to the docker container:
docker cp foo.txt container_id:/foo.txt
Reading some data:
df = spark.read.csv('Vermont_Vendor_Payments.csv', header='true', inferSchema = True)
df.show()
df = df.withColumn("Amount", df["Amount"].cast("double"))