My Introduction To Data Science With F#

You may have seen a big boom lately with web technologies, especially with all of the JavaScript frameworks coming out at a pace of around one a week. But there is another boom you may have heard about…data science.

Now I’ll be the first to admit here that I’m not any kind of expert in data science; not even close. I’ve just been playing around with the idea within F# for a little while and have enjoyed doing a few projects with it, and then thought that experience would make for a good post.

While the main languages of data science is R and Python, F# can also be used to great extent within this realm. However, if there is a need to use R, then it even has its own type provider. Even Python can be used within F# by using the Python for .NET library. With these two languages able to be inter-opted within F# along with the power of F# itself, I don’t see any reason not to use F# for data science.

With that in mind, I set out to start messing with a data science project. Something that would seem a bit more like a real world problem but still simple enough to get my feet wet with. With that I decided to mess around with Leada and did one of their projects. The first one was a fairly small project to get some ideas about data for deliveries in certain zip codes.

The first task was, with the given CSV, to start cleaning the data a bit to remove the zip codes that had a delivery count less than 20.

First off, of course, is that we need to get our references and open our namespaces out of the way. I just included FSharp.Data in via NuGet and referenced it:

#I "../packages/FSharp.Data.2.0.14/lib/net40"

#r "FSharp.Data.dll"

open FSharp.Data

Then we open up our CSV file and load that in. This will actually give a compilation error if the CSV file can’t be found.

let deliveryData = 

    CsvProvider<"delivery_data.csv">.GetSample()

Now let’s take all the rows from the file and filter out all of the zip codes that have a delivery count that’s greater than 20. Note here that in Seq.countBy id, id is a built in F# function.

let validZipCodes = 

    deliveryData.Rows

        |> Seq.map(fun r -> r.pickup_zipcode)

        |> Seq.countBy id

        |> Seq.filter(fun (z, c) -> c > 20 && z.HasValue)

        |> Seq.map(fun (z, c) -> z.Value)

Next I created a small function to filter by a zip code that is passed in.

let filterByZip (pickupZip: System.Nullable<int>) inputZip =        

  pickupZip.HasValue && pickupZip.Value = inputZip

The project states that they wanted a count of zip codes based on the purchase price – below 60, between 60 and 120, and above 120. To get these counts I use the function from above to help and came up with the below.

deliveryData.Rows

    |> Seq.filter(fun i -> (filterByZip i.Pickup_zipcode zip) && i.Purchase_price < 60m)

    |> Seq.map(fun i -> i.Pickup_zipcode)

    |> Seq.countBy id

When running this on the data, I get results from the interactive – a list of a two item tuple that’s a nullable integer (zip code) and an integer (the zip code’s count).

(System.Nullable<int> * int) list =

  [(94102, 3647); (94110, 7689); (94109, 5089); (94114, 2849); (94107, 2956);

   (94115, 3970); (94117, 3312); (94111, 1343); (94133, 1649); (94118, 1467);

   (94103, 6214); (94123, 5906); (94105, 1261); (94158, 236); (94132, 930);

   (94121, 369); (94108, 778); (94116, 292); (94112, 226); (94122, 1335);

   (94131, 330); (94539, 39); (94127, 198); (94124, 172); (94903, 13);

   (94104, 247); (94129, 15)]

We can do something similar with the other price ranges just by modifying the above code to include it and run that against our data.

Now there’s still a bit of work we can still do with this. It can definitely be refactored to have more reusable functions and I can clean the data to make it a lot easier to read and to actually give out the results to with our findings from the data.

With just this bit done, however, it seems data science can be quite enjoyable and challenging. I’m sure this won’t be my last shot at it.