Blog

Creating Datasets from Public URLs

Charlie Harrington on November 3, 2017


We're trying to make it easier to discover and use interesting datasets on FloydHub. To that end, we just released an improved file-viewer to help you dig in and explore data directly on FloydHub. For example, here's one of my favorite pups from the Kaggle Cats vs. Dogs dataset.

dog

Lots of work still to be done here - like CSV viewers and more - but we're hoping this already makes your deep learning life a little bit more fun.

We also spruced up our Explore page last week. We're now featuring collections of the projects and datasets you'll need to survive the Udacity Deep Learning Nanodegree and Fast.ai Part 1 courses. And more collections are coming soon (hello, deeplearning.ai - we're on to you!)

But - if there's one thing you've told us loud and clear through our forum - it's that you'd like an faster way to create FloydHub datasets directly from public URLs.

And guess what? We agree. Yes, yes - one might argue (and we've tried) that it's technically possible to do this right now, but, really, the process is sort-of (most definitely) annoying. You need to first download the dataset directly to your machine - which might take forever if it's large enough - and then upload to back to FloydHub - which might also take forever.

There's got to be another way.

Today, we're releasing the first step towards a brighter dataset future - creating datasets directly from your FloydHub job outputs.

What the - how does this help?

Let me explain. Work with me here.

Let's say you've found a great dataset - like a list of the current members of the United States Congress - and you'd like to turn this CSV into a FloydHub dataset.

Now, before we continue, it's probably worth rehashing why we would even want to create a dataset on FloydHub. Seems like a hassle. Well, creating a dataset on FloydHub has many benefits - but, primarily, it boils down being able to easily use the data in future GPU-powered training jobs on FloydHub. That's why we're here in the first place!

Okay, next up - let's create a new project on FloydHub called washington and fire up a fresh Jupyter Notebook session from the command line using the floyd-cli tool.

$ floyd init washington
$ floyd run --mode jupyter

Our new Jupyter notebook session should open up automatically in our browser. Once we're inside, let's first head over to the Jupyter terminal to grab the CSV data.

initial

Let's take a look at where we are.

$ pwd
/output
$ ls
command.sh

When running in Jupyter mode, FloydHub automatically places us in the /output directory of our Jupyter notebook instance. Let's create a new folder called /congress to hold our data, and we'll use wget to fetch the CSV to our Jupyter instance.

$ mkdir congress
$ cd congress 
$ wget https://theunitedstates.io/congress-legislators/legislators-current.csv

Great! Now that we've grabbed the CSV, we can either stop the entire job right now and create a dataset right away - or we can open up a notebook to play around with the data, make sure it's what we want, and even clean it up a bit before saving it as a dataset. Let's do the latter.

Open up a new Python notebook from the main Jupyter session window.

initial

Next, let's explore our data a bit:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
congress = pd.read_csv("congress/legislators-current.csv")

How about we take a quick peek at our data:

congress.head(5)

head

I don't like the looks of that district column - there's a lot of missing values. We could try replacing or transforming these missing values, but - for now - let's just drop it entirely from our dataset:

congress.drop(["district"], axis=1, inplace=True)

Before we go, let's try a chart to make sure we're looking at whole picture.

parties = congress.groupby('party')
parties.size().plot(kind='bar')

parties

Ah, 2017. Finally, let's save our file so that we don't lose our changes to our data (remember, we dropped the district column):

congress.to_csv("congress/legislators-current.csv")

Now, you can Save and Checkpoint your notebook, and then stop the running CPU job from FloydHub by clicking Cancel. Then you can head over to the Output tab of your job, and click the new Create Dataset button.

output

A real friendly-looking modal will pop up and ask which dataset you'd like to use - you can either add to one of your existing datasets or create a nice, new, fresh one right here.

modal

Once that's done, you'll be whisked away to your newly created dataset on FloydHub, now populated with that public URL CSV data you wanted so badly.

dataset

We've made it. You've done it. Now you can reference this new dataset the next time you're running a job!

Whole lotta steps

Yes, this is still cumbersome - but this should make things much faster since you no longer need to download datasets locally to your machine. Let us know what you think! There's a feedback button on the output modal with a short survey, or just hit us up on Twitter or in the forum.