Clicky

A to Z Spark, Python in Docker

September 27, 2022

Disclaimer:

Work built over the work of “Bryant Crocker at https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

Pull your docker image

docker run -it -p 8888:8888 jupyter/pyspark-notebook

From

To access the server, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/jpserver-6-open.html
    Or copy and paste one of these URLs:
        http://9c1bc9bbe072:8888/lab?token=27c58547418746f661649a5798a2cf2165e6670cffb49b59
        http://127.0.0.1:8888/lab?token=27c58547418746f661649a5798a2cf2165e6670cffb49b59

Copy the token token from the terminal. Open http://localhost:8888. Paste the token on the open text box.

Test things out:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Download the data from https://data.vermont.gov/Finance/Vermont-Vendor-Payments/786x-sbp3/about_data by using the export button.

Copy file from host to the docker container:

docker cp foo.txt container_id:/foo.txt

Reading some data:

df = spark.read.csv('Vermont_Vendor_Payments.csv', header='true', inferSchema = True)
df.show()
df = df.withColumn("Amount", df["Amount"].cast("double"))