神刀安全网

Transforming Data with a Functional Approach

Transforming Data with a Functional Approach

Almost without fail, every time folks talk about the benefits of functional programming ( FP ) someone will mention it is “easier to reason about.” Speakers and bloggers make it sound obvious that immutable values, pure functions, and laziness lead to code that is easier to understand, maintain and test. But what does this actually mean when you are in front of a screen, ready to bash something out? What are some actual examples of “easier to reason about?” I recently worked on a project where I found a functional approach (using Clojure) really did bring about some of these benefits. Hopefully this post will provide some real-world examples how FP has some amazing advantages.

Speakers and bloggers make it sound obvious that immutable values, pure functions, and laziness lead to code that is easier to understand, maintain and test.

The project involved loading and processing about a hundred thousand jobs so they can be indexed by ElasticSearch. The first step is easy enough: just load the data from the filesystem. We’ll use Extensible Data Notation ( EDN ), a data format that is really easy to use from within Clojure.

(def jobs (edn/read-string (slurp “jobs.edn”)))

The `slurp` function takes a filename and returns all its contents as a string. `edn/read-string` parses the EDN in that string and returns a Clojure data structure. In this case, a list where each job is represented as a Clojure map that looks something like this

{ :id 123    :location “New York, NY”    :full_description “This job will involve …”    :industry “Lumber”    … }

Pretty straightforward and nothing really functional programingy just yet. However, trying to index these immediately throws a bunch of errors. ElasticSearch complains that some of these jobs have nulls for their locations. So it seems we have some bad data, begging the question, how many of our jobs are affected? We could manually loop through all those jobs and keep a count of the ones with nulls as locations. Yuck! Instead, let’s try an FP approach. Given a single job, we want to check if it has a `nil` for a location. Then, we filter all the jobs we have to just those `nil` ones and see how many we get.

(count (filter #(nil? (:location %)) jobs))

Note: The semicolon character simply starts a comment and is used here to display the result.

First, the call to `filter` takes two arguments, a function and a collection. The function should take elements of the collection, in this case individual jobs, and return whether or not the given job’s location (extracted using `get`) is equal to `nil` (equality checking is done with `=`).

Okay, so about a hundred bad jobs out of thousands. That’s not a whole lot, so we should be okay ignoring those. To do that, we want the opposite of the above and filter out jobs that *don’t* have nulls for locations.

(def no-nulls (filter #(not (nil? (:location %))) jobs))

Instead of `=` we use `not=` to check the opposite. ElasticSearch can now happily take our jobs and index them just fine.

We start doing some test searches against our ElasticSearch instance and quickly find out the results contain far more data than we need. Each job comes with a full job description that often contains a huge amount of text. But we don’t need all of that information. We want to create a new field for our jobs that contains a truncated version of this description. That way, the search results can be much smaller, reducing network traffic and time ElasticSearch spends serializing and marshaling all that data.

Let’s again try the FP approach. When we used `filter` earlier we were thinking of what had to be done with each individual job (check if a location is `nil`) and then we let `filter` and `count` do the grunt work. Here, what we need to do with each individual job is to add a new field. Since jobs are just hashmaps, we can simply add a new key-value pair. As for the truncated description, we can just take the first hundred characters of the full description. This is a bit more involved than the `filter` example, so let’s start with writing a function that handles an individual job.

(defn assoc-truncated-desc    [job]    (let        [full-desc (:full_description job)]      (assoc job :truncated_desc (subs full-desc 0 100))))

This function takes a single `job`, gets the full description, and assigns it to `full-desc`, and then calls `assoc` to create the new key-value mapping. The key is the name of our new field, `:truncated_desc`, and the value is just a substring of the first 100 characters.

A subtle thing to note is the original `job` does not change. `assoc` returns a new hashmap with the new key value pairing. This means `assoc-truncated-desc` is a pure function that takes an existing job and returns a new one, never mutating any state. Let’s say we call it on one of our jobs and find out that 100 characters is not enough. The original job remains untouched, so we can modify the code (say, change 100 to 150) and call it again until it’s just right.

So we’ve written a function that can process a single job. Hooray! How do we do this with 100,000 jobs? Instead of filtering out some number of jobs as we did with the null locations, we want to apply this transformation to every single job. This is exactly what the `map` function is for. Let’s call it on `no-nulls` from before.

(map assoc-truncated-desc no-nulls)

And that’s it! We return a new list where every job has the new truncated description field, ready to be indexed and searched. The final code looks like this:

(defn prepare-jobs    [jobs]    (map assoc-truncated-desc         (filter #(not (nil? (:location %))) jobs)))

We started with a bunch of raw jobs from the filesystem and turned them into something that fits our needs. An imperative approach might have you start with writing a for loop and filter out how you want to modify the whole list. Functional programming takes a different approach by separating out the actual modification (remove a null value, adding a new field) from the way you wanted to *apply* a transformation (either by filtering or by mapping). We start with thinking at the level of an individual element and what we want to do to that element., allowing us to figure out whether we need to `filter`, `map`, `partition`, etc. This approach lets us focus on the “meat” of the problem.

Functional programming takes a different approach by separating out the actual modification from the way you wanted to *apply* a transformation

Immutability and purity are simply what makes this approach optimal. A job is never changed so we can keep mucking about with it until we have what we want. A transformation does not rely on any state so we can make sure it works in isolation before even thinking about collections and loops. Functional programming helps us focus on solving the problem at hand by isolating unrelated concerns so we can, so to speak, “reason” about it more easily.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Transforming Data with a Functional Approach

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址