Chapter 10. Programming with Pig

One frequent complaint about MapReduce is that it’s difficult to program. When you first think through a data processing task, you may think about it in terms of data flow operations, such as loops and filters. However, as you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chaining. Certain functions that are treated as first-class operations in higher-level languages become nontrivial to implement in MapReduce, as we’ve seen for joins in chapter 5. Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability. Yahoo, one of the heaviest user of Hadoop (and a backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig. Twitter is also another well-known user of Pig. [1]

Pig has two major components:

  1. A high-level data processing language called Pig Latin.
  2. A compiler that compiles and runs your Pig Latin script in a choice of evaluation mechanisms. The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.