Fork me on GitHub

Dumbo

Dumbo is a project that allows you to easily write and run Hadoop programs in Python (it’s named after Disney’s flying circus elephant, since the logo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs.

def mapper(key, value):
    for word in value.split():
        yield word, 1

def reducer(key, values):
    yield key, sum(values)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper, reducer, combiner=reducer)

Defining features

Easy
Dumbo strives to be as Pythonic as possible – MapReduce programs that use it are easy on the eyes for people who read them and easy on the fingers for those who write them. Dumbo also provides more than enough boilerplate functionality and additional features to give (directly) using Hadoop Streaming a run for its money. You'll never again even think of writing a job consisting of multiple MapReduce iterations using traditional Streaming once you've done it with Dumbo for instance.
Efficient
Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. Moreover, Dumbo makes it very easy to write resource-intensive parts of your jobs natively in Java to squeeze out the last few drops of performance.
Flexible
Although it tries very hard to be as simple as possible to use, Dumbo never stands in your way. Nothing prevents you from doing the lower level things required to, e.g., read or write custom input formats (being it binary or text-based), use a specific partitioning scheme, or implement a tricky secondary sort. There effectively is nothing you can do with native Hadoop progams in Java that cannot be done in Dumbo progams, since you can always add in some Java code when needed thanks to Dumbo's heavily streamlined Java integration.
Mature
Dumbo was the first Python API to be built on top of Hadoop and has been used in production by several different people at various companies for years now. It's a proven technology that won't be going away anytime soon and has been made to run in many different environments, including Amazon Elastic MapReduce.

Documentation

Development