Computational Science in the Cloud Institute 2018

View on GitHub

Analyzing Large CSVs Part 2

In this section, we turn to the 2015 Yellow Cab Data Set. The kinds of operations we use are similar to those from the previous section. (Special thanks to Matt Rocklin, founder of dask and many other great open source Python libraries, for the permission to reuse the materials below).

Remember that all CSV files need to be downloaded to all nodes.

Let’s start by reading in the dataset. We can use the read_csv() function and pass it all the files using a glob. We also need to have pandas parse the date fields.

csv = '/root/yellow_tripdata_2015-*'
df = dd.read_csv(csv, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

Do a similar basic exploration of the data. You could begin with these questions:

The payment type is an integer which codes for the following payment types:

1: 'Credit Card'
2: 'Cash', 
3: 'No Charge', 
4: 'Dispute', 
5: 'Unknown', 
6: 'Voided trip'

You could investigate:

Finally, let’s explore the value of computing an index. Let’s find an efficient solution to getting the first 10 cab rides of every month.

To set an index on column, we use

df.set_index('<column_name')

however we must the persist that to cluster by sending it to client.persist().

One approach to the above: