Dumbo is a project that allows you to easily write and run
Hadoop
programs in Python (it’s named after Disney’s flying circus elephant,
since the logo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally,
Dumbo can be considered a convenient Python API for writing MapReduce programs.
def mapper(key, value):
for word in value.split():
yield word, 1
def reducer(key, values):
yield key, sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper, reducer, combiner=reducer)
Defining features
- Easy
- Dumbo strives to be as Pythonic as possible – MapReduce programs that use it are easy
on the eyes for people who read them and easy on the fingers for those who write them. Dumbo also
provides more than enough boilerplate functionality and additional features to give (directly)
using Hadoop Streaming a run for its money. You'll never again even think of writing
a job
consisting of multiple MapReduce iterations using traditional Streaming once you've done it
with Dumbo for instance.
- Efficient
- Dumbo programs communicate with Hadoop in a very effecient way by relying on
typed bytes, a nifty
serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. Moreover, Dumbo
makes it very easy to
write resource-intensive parts of your jobs natively in Java to squeeze out the last few drops of
performance.
- Flexible
- Although it tries very hard to be as simple as possible to use, Dumbo never stands in your way.
Nothing prevents you from doing the lower level things required to, e.g., read or write custom
input formats (being it binary or text-based), use a specific partitioning scheme, or implement a
tricky secondary sort. There effectively is nothing you can do with native Hadoop progams in
Java that cannot be done in Dumbo progams, since you can always add in some Java code when needed
thanks to Dumbo's heavily streamlined
Java integration.
- Mature
- Dumbo was the first Python API to be built on top of Hadoop and has been used in production
by several different people at various companies for years now. It's a proven technology that won't
be going away anytime soon and has been made to run in many different environments,
including Amazon Elastic
MapReduce.
Documentation
Development