Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Hadoop

sarva
April 07, 2016

Introduction to Hadoop

Hadoop tutorial

sarva

April 07, 2016
Tweet

More Decks by sarva

Other Decks in Programming

Transcript

  1. What is Hadoop? • A framework for large-scale distributed data

    processing • Advantages ◦ Can handle petabytes of data ◦ Can scale to a cluster containing thousands of computers ◦ Allows a user to focus on the data processing logic ◦ Takes care of task distribution and fault tolerance • Two main components ◦ Hadoop Distributed File System ◦ MapReduce
  2. A Brief History of Hadoop • In 2003-2004, Google researchers

    published papers on the Google File System and MapReduce • Doug Cutting did an initial implementation of Hadoop based on the papers • Yahoo hired Cutting in 2006 and provided a team for development • In 2008, Yahoo announced that their search engine was using Hadoop running on a 10,000+ core Linux cluster • The word Hadoop is the name Cutting’s son gave to a toy elephant
  3. Use Cases of Hadoop • Filtering ◦ Web search ◦

    Product search • Classification ◦ Spam detection ◦ Fraud/anomaly detection • Recommendation Engine ◦ Online retailers suggesting products ◦ Social networks suggesting people you may know
  4. Hadoop vs RDBMS RDBMS • Structured data • Optimized for

    queries and updates at arbitrary locations • Suitable for small data • Cannot scale to web-scale data Hadoop • Structured/unstructured data • Optimized for sequential reading and batch processing of data • Suitable for web-scale data • High latency for small data
  5. Hadoop vs MPI MPI • Suitable for long-running computations which

    involve small amounts of data • No fault tolerance provided by the framework • More flexible program structure Hadoop • Typically used for short computations on large amounts of data • Fault tolerance provided by default • Programs restricted to have “MapReduce” structure
  6. Hadoop Distributed File System • A file is split into

    blocks of size 64 MB • Each block is replicated 3 times • Blocks are distributed randomly on the machines in the cluster • NameNode: machine which holds the files to block mapping • DataNodes: machines which store the blocks
  7. MapReduce • Arbitrary programs cannot be run on Hadoop •

    Programs should conform to the MapReduce programming model • MapReduce programs transform lists of input data elements into lists of output data • The input and output lists are constrained to be lists of key-value pairs ◦ A key-value pair is an ordered pair (k,v) where k is the key and v is the value ◦ Example: [ (“Alice”, 28), (“Bob”, 35), (“Eve”, 28) ] • Two list processing idioms are used ◦ Map ◦ Reduce
  8. Examples of Map • Square [3, 6, 5, 9, 10]

    → [9, 36, 25, 81, 100] • isPrime [3, 6, 5, 9, 10] → [True, False, True, False, False] • toUpper [“This”, “is”, “a”, “test”] → [“THIS”, “IS”, “A”, “TEST”]
  9. Examples of Reduce • Summation [3, 6, 5, 9, 10]

    → 33 • Median [3, 6, 5, 9, 10] → 6 • Histogram [4, 6, 1, 6, 4, 4, 1, 1] → [(1,3), (4,3), (6,2)]
  10. An Example Application: Word Count • Count how many times

    different words appear in a set of files ◦ Use case: Spam filtering • Suppose we have two files ◦ file1.txt: Hello, this is the first file ◦ file2.txt: This is the second file • Expected output hello 1 first 1 this 2 second 1 is 2 file 2 the 2
  11. Word Count as MapReduce: Mapper Mapper pseudocode mapper (file-contents): for

    each word in file-contents: emit (word, 1) • file1.txt: Hello, this is the first file ◦ Mapper output (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) • file2.txt: This is the second file ◦ Mapper output (this, 1) (is, 1) (the, 1) (second, 1) (file, 1)
  12. Word Count as MapReduce: Reducer • Hadoop groups values with

    same key • Reducer pseudocode reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) • Output of mapper stage (hello, 1) (this, 1) (is, 1) (the, 1) (first, 1) (file, 1) (this, 1) (is, 1) (the, 1) (second, 1) (file, 1) • Input to reducer stage (hello, [1]) (this, [1, 1]) (is, [1, 1]) (the, [1,1]) (first, [1]) (file, [1, 1]) (second, [1]) • Output of reducer stage (hello, 1) (this, 2) (is, 2) (the, 2) (first, 1) (file, 2) (second, 1)
  13. MapReduce Data Flow • Several mapper processes are created each

    processing file blocks on a node • Intermediate (key,value) pairs are exchanged to send all values with same key to a single reducer • Each reducer generates an output file • The reducer outputs can be fed to a second MapReduce job for further processing
  14. Combiner Function • Suppose a file has 1000 occurrences of

    the word “is” • 1000 key-value pairs equal to (is, 1) will be created and sent to the reducer for “is” • A more efficient method is to just send (is, 1000) • This node-local processing is done by the Combiner • In this case, the Combiner has the same implementation as the Reducer
  15. Partitioner Function • The default implementation distributes the keys randomly

    among the reducers ReducerIndex = Hash(key) % NumReducers • In WordCount, suppose we want all keys starting with the same letter to go to the same reducer • A custom Partitioner can achieve this ReducerIndex = Hash(FirstLetterOfKey) % NumReducers
  16. WordCount Mapper in Java public static class MapClass extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  17. WordCount Mapper in Java public static class MapClass extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • MapClass is the user-defined class implementing the Mapper interface • User needs to implement the map function
  18. WordCount Mapper in Java public static class MapClass extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • By default, Hadoop assumes that the input is a text file where each line needs to be processed independently • Input key to Mapper is LongWritable - the byte offset of a line in a file • Input value is Text - the contents of the line
  19. WordCount Mapper in Java public static class MapClass extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • Output key of Mapper is of type Text • Output value is IntWritable
  20. WordCount Mapper in Java public static class MapClass extends MapReduceBase

    implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } • Create a variable to hold the constant one • Split a line into words • For each word, emit a key-value pair consisting of (word, 1)
  21. WordCount Reducer in Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  22. WordCount Reducer in Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Reduce is the user-defined class implementing the Reducer interface • User needs to implement the reduce function
  23. WordCount Reducer in Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Input key to Reducer is Text - the word emitted by the mapper • Input value is a list of IntWritable values - the list of 1s
  24. WordCount Reducer in Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Output key of Reducer is of type Text • Output value is IntWritable
  25. WordCount Reducer in Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Initialize sum to zero • Add all the 1s in the values list • Emit the key-value pair (word, sum)
  26. WordCount Driver public void run(String inputPath, String outputPath) throws Exception

    { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }
  27. WordCount Driver public void run(String inputPath, String outputPath) throws Exception

    { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • Initialize the JobConf object • Give it a name
  28. WordCount Driver public void run(String inputPath, String outputPath) throws Exception

    { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • Set the Reducer output key type to be Text • Set the Reducer output value type to be IntWritable • The Mapper input key-value types are assumed to be the default - (LongWritable, Text)
  29. WordCount Driver public void run(String inputPath, String outputPath) throws Exception

    { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • The classes which implement the map and reduce functions are specified
  30. WordCount Driver public void run(String inputPath, String outputPath) throws Exception

    { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } • The locations of the input files and output files are specified • The job is then executed
  31. Fault Tolerance • In large clusters, individual nodes or network

    components may fail • Hadoop achieves fault tolerance by restarting tasks • A MapReduce job is monitored by a JobTracker • Each map task and reduce task is assigned to a TaskTracker • If a TaskTracker fails to ping the JobTracker for one minute, it is assumed to have crashed • Other TaskTrackers will re-execute tasks assigned to the failed TaskTracker
  32. MapReduce Design Patterns • Summarization ◦ Numerical summarization ◦ Inverted

    Index • Filtering ◦ Top 10 lists • Data organization ◦ Sorting
  33. Numerical Summarizations • Minimum, Maximum, Average, Median, Standard Deviation •

    Suppose we have bank transaction data in the following format • We want to find the maximum, minimum and average transaction amount for each of the PAN card numbers Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHAJK2345L 1,00,000
  34. MapReduce Structure • Input to the mapper will be a

    list of (LongWritable, Text) pairs • For each input pair, mapper outputs a (PAN card, Amount) pair • All mapper outputs with same PAN card will arrive at a single reducer • Input to reducer will be (PAN card, [Amount1, Amount2, ..., AmountN]) • Maximum and minimum are the largest and smallest amounts in the list • For average, divide sum of amounts by number of amounts
  35. Inverted Index • Faster web search ◦ Each web page

    contains a list of words ◦ An inverted index is a list of webpages which contain a particular word • Citations ◦ Every patent may cite some other past patents ◦ An inverted index is a list of patents which cite a particular patent
  36. Vehicle Tracking • Suppose streets in Delhi are equipped with

    CCTVs which can perform number plate recognition • Each camera sends a list of vehicles it has seen • Suppose we want to know which areas a vehicle visited on a particular day Date CCTV ID Vehicle Number 21/11/2015 123 DL9C 1234 30/11/2015 101 DL5A 7890 01/12/2015 123 DL8B 5555 23/01/2016 155 DL9C 1234
  37. MapReduce Structure • Input to the mapper will be a

    list of (LongWritable, Text) pairs • For each input pair, mapper outputs a (Vehicle Number, CCTV ID) pair if the vehicle was seen on the day of interest • All mapper outputs with same Vehicle Number will arrive at a single reducer • Input to reducer will be (Vehicle Number, [CCTV ID1, CCTV ID2, ..., CCTV IDn]) • The reducer removes any duplicates in the CCTV ID list
  38. Top 10 List • Given bank transaction data, suppose we

    want to find the 10 largest transactions Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000
  39. Top 10 List Input Split Input Split Input Split Top

    Ten Mapper Top Ten Mapper Top Ten Mapper Top Ten Reducer Local top 10 Local top 10 Local top 10 Top Ten Output Final top 10
  40. MapReduce Structure • Set the number of reducers to one

    • Input to the mapper will be a list of (LongWritable, Text) pairs • Each mapper outputs ten (NULL, (Amount, PAN Card, Date)) pairs corresponding to the ten largest amounts • All mapper outputs will arrive at the single reducer • The input to the reducer will be (NULL, [(A 1 , PC 1 , D 1 ), (A 2 , PC 2 , D 2 ), …, (A 10M , PC 10M , D 10M ) ]) • The reducer computes the top ten transactions
  41. Sorting • Given bank transaction data, suppose we want to

    sort all transactions in ascending order of amounts Date PAN card number Amount 21/11/2015 ABCDE1234F 80,000 30/11/2015 PQRST4567U 1,20,000 01/12/2015 ABCDE1234F 25,000 23/01/2016 GHIJK2345L 1,00,000
  42. MapReduce Structure • Input to the mapper will be a

    list of (LongWritable, Text) pairs • Suppose we set the mapper output to a (Amount, (PAN Card, Date)) pair • All mapper outputs corresponding to same amount will arrive at the single reducer • But the default partitioner in Hadoop does not guarantee that amounts which are near will arrive at the same reducer • We need a custom partitioner
  43. Custom Partitioner for Sorting Input Split Input Split Input Split

    Mapper Custom Partitioner Mapper Mapper Custom Partitioner Custom Partitioner Reducer Reducer Reducer 0 to 1L 5L and above 1L to 5L Sorted Output Sorted Output Sorted Output
  44. MapReduce Structure • Input to the mapper will be a

    list of (LongWritable, Text) pairs • Mapper outputs are (Amount, (PAN Card, Date)) pairs • Mapper outputs with amounts in same range will arrive at the same reducer • The input to the reducer will be sorted by amount (Amount1, [(PAN Card1, Date1), (PAN Card2, Date2), …]) (Amount2, [(PAN Card3, Date3), (PAN Card4, Date4), …]) (AmountN, [(PAN Card5, Date5), (PAN Card6, Date6), …]) • Each reducer will output all the transactions it receives • The outputs of all reducers can be concatenated to get the sorted data … … …
  45. Reasons for Hadoop Popularity • Ease of use ◦ Hadoop

    takes care of the challenges of distributed computing ◦ User can focus on the data processing logic ◦ Same program can be executed on a 10 machine cluster or 1000 machine cluster • MapReduce is flexible ◦ A large class of problems can be expressed as MapReduce computations • Scale out ◦ Can scale to clusters having thousands of machines ◦ Can handle web-scale data
  46. Learning resources • Introduction to Hadoop and MapReduce, MOOC from

    Udacity, https://www.udacity.com/courses/ud617 • Hadoop Tutorial from Yahoo!, https://developer.yahoo.com/hadoop/tutorial/ • Hadoop: The Definitive Guide, Tom White, O'Reilly Media, 2012 • Hadoop in Action, Chuck Lam, Manning Publications, 2010 • MapReduce Design Patterns, Donald Miner and Adam Shook, O'Reilly Media, 2012
  47. Attribution Some figures were taken from the "Hadoop Tutorial from

    Yahoo!" by Yahoo! Inc. which is licensed under a Creative Commons Attribution 3.0 Unported License. No changes were made to the figures. https://creativecommons.org/licenses/by/3.0/