"I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important."
- Linus Torvalds
"Smart data structures and dumb code works a lot better than the other way around."
- Eric S. Raymond, The Cathedral and The Bazaar
Learning about data structures and algorithms makes you a stonking good programmer.
Data structures and algorithms are patterns for solving problems. The more of them you have in your utility belt, the greater variety of problems you'll be able to solve. You'll also be able to come up with more elegant solutions to new problems than you would otherwise be able to.
You'll understand, in depth, how your computer gets things done. This informs any technical decisions you make, regardless of whether or not you're using a given algorithm directly. Everything from memory allocation in the depths of your operating system, to the inner workings of your RDBMS to how your networking stack manages to send data from one corner of Earth to another. All computers rely on fundamental data structures and algorithms, so understanding them better makes you understand the computer better.
Cultivate a broad and deep knowledge of algorithms and you'll have stock solutions to large classes of problems. Problem spaces that you had difficulty modelling before often slot neatly into well-worn data structures that elegantly handle the known use-cases. Dive deep into the implementation of even the most basic data structures and you'll start seeing applications for them in your day-to-day programming tasks.
You'll also be able to come up with novel solutions to the somewhat fruitier problems you're faced with. Data structures and algorithms have the habit of proving themselves useful in situations that they weren't originally intended for, and the only way you'll discover these on your own is by having a deep and intuitive knowledge of at least the basics.
But enough with the theory, have a look at some examples
Figuring out the fastest way to get somewhere
Let's say we're creating software to figure out the shortest distance from one international airport to another. Assume we're constrained to following routes:
Based on that graph of destinations and the distances between them, how can we find the shortest distance say, from Helsinki to London? Dijkstra's algorithm is the algorithm that will definitely get us the right answer in the shortest time.
In all likelihood, if you ever came across this problem and knew that Dijkstra's algorithm was the solution, you'd probably never have to implement it from scratch. Just knowing about it would point you to a library implementation that solves the problem for you.
If you did dive deep into the implementation, you'd be working through one of the most important graph algorithms we know of. You'd know that in practice it's a little resource intensive so an extension called A* is often used in it's place. It gets used everywhere from robot guidance to routing TCP packets to GPS pathfinding.
Figuring out the order to do things in
Let's say you're trying to model courses on a new Massive Open Online Courses platform (like Udemy or Khan Academy). Some of the courses depend on each other. For example, a user has to have taken Calculus before she's eligible for the course on Newtonian Mechanics. Courses can have multiple dependencies. Here's are some examples of what that might look like written out in YAML:
# Mapping from course name to requirements # # If you're a physcist or a mathematicisn and you're reading this, sincere # apologies for the completely made-up dependency tree :) courses: arithmetic:  algebra: [arithmetic] trigonometry: [algebra] calculus: [algebra, trigonometry] geometry: [algebra] mechanics: [calculus, trigonometry] atomic_physics: [mechanics, calculus] electromagnetism: [calculus, atomic_physics] radioactivity: [algebra, atomic_physics] astrophysics: [radioactivity, calculus] quantumn_mechanics: [atomic_physics, radioactivity, calculus]
Given those dependencies, as a user, I want to be able to pick any course and
have the system give me an ordered list of courses that I would have to take to
be eligible. So if I picked
calculus, I'd want the system to return the list:
arithmetic -> algebra -> trigonometry -> calculus
Two important constraints on this that may not be self-evident:
- At every stage in the course list, the dependencies of the next course must be met.
- We don't want any duplicate courses in the list.
This is an example of resolving dependencies and the algorithm we're looking
for to solve this problem is called topological sort (tsort). Tsort works on a
dependency graph like we've outlined in the YAML above. Here's what that would
look like in a graph (where each arrow means
What topological sort does is take a graph like the one above and find an
ordering in which all the dependencies are met at each stage. So if we took a
sub-graph that only contained
radioactivity and it's dependencies, then ran
tsort on it, we might get the following ordering:
arithmetic algebra trigonometry calculus mechanics atomic_physics radioactivity
This meets the requirements set out by the use case we described above. A user
just has to pick
radioactivity and they'll get an ordered list of all the
courses they have to work through before they're allowed to.
We don't even need to go into the details of how topological sort works before
we put it to good use. In all likelihood, your programming language of choice
probably has an implementation of it in the standard library. In the worst case
scenario, your Unix probably has the
tsort utility installed by default, run
man tsort and have a play with it.
Other places tsort get's used
- Tools like
makeallow you to declare task dependencies. Topological sort is used under the hood to figure out what order the tasks should be executed in.
- Any programming language that has a
requiredirective, indicating that the current file requires the code in a different file to be run first. Here topological sort can be used to figure out what order the files should be loaded in so that each is only loaded once and all dependencies are met.
- Project management tools with Gantt charts. A Gantt chart is a graph that outlines all the dependencies of a given task and gives you an estimate of when it will be complete based on those dependencies. I'm not a fan of Gantt charts, but it's highly likely that tsort will be used to draw them.
Squeezing data with Huffman coding
Huffman coding is an algorithm used for lossless data compression. It works by
analyzing the data you want to compress and creating a binary code for each
character. More frequently occurring characters get smaller codes, so
be encoded as
x might be
10010. The codes are created so that
they can be concatenated without a delimeter and still be decoded accurately.
Huffman coding is used along with LZ77 in the DEFLATE algorithm which is used by
gzip to compress things. gzip is used all over the place, in particular for
compressing files (typically anything with a
.gz extension) and for http
requests/responses in transit.
Knowing how to implement and use Huffman coding has a number of benefits:
- You'll know why a larger compression context results in better compression overall (e.g. the more you compress, the better the compression ratio). This is one of the proposed benefits of SPDY: that you get better compression on multiple HTTP requests/responses.
- If you ever find yourself trying to forcibly decipher encrypted information , you may realize that since repeating data compresses better, the compression ratio of a given bit of ciphertext will help you determine it's block cipher mode of operation.
Picking what to learn next is hard
As long as our dominant model of computation stays the same, data structures and algorithms that we use today will be used in some form or another in the future. You can safely spend time on gaining a deep and thorough knowledge of them and know that they will pay dividends for your entire career as a programmer.