log.nikhil.io

Don’t Use Hadoop - Your Data Isn’t That Big

A still-very-relevant 9-year old article. Pandas has gone from strength to strength since he wrote that.

In terms of expressing your computations, Hadoop is strictly inferior to SQL. There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files.

SQL is a straightforward query language with minimal leakage of abstractions, commonly used by business analysts as well as programmers. Queries in SQL are generally pretty simple. They are also usually very fast - if your database is properly indexed, multi-second queries will be uncommon.

Hadoop does not have any conception of indexing. Hadoop has only full table scans. Hadoop is full of leaky abstractions - at my last job I spent more time fighting with java memory errors, file fragmentation and cluster contention than I spent actually worrying about the mostly straightforward analysis I wanted to perform.

If your data is not structured like a SQL table (e.g., plain text, json blobs, binary blobs), it’s generally speaking straightforward to write a small python or ruby script to process each row of your data. Store it in files, process each file, and move on. Under circumstances where SQL is a poor fit, Hadoop will be less annoying from a programming perspective. But it still provides no advantage over simply writing a Python script to read your data, process it, and dump it to disk.

In addition to being more difficult to code for, Hadoop will also nearly always be slower than the simpler alternatives. SQL queries can be made very fast by the judicious use of indexes - to compute a join, PostgreSQL will simply look at an index (if present) and look up the exact key that is needed. Hadoop requires a full table scan, followed by re-sorting the entire table. The sorting can be made faster by sharding across multiple machines, but on the other hand you are still required to stream data across multiple machines. In the case of processing binary blobs, Hadoop will require repeated trips to the namenode in order to find and process data. A simple python script will require repeated trips to the filesystem.