Friday, June 27, 2008

Hadoop and Table Joins !!

In continuation of my post yesterday I want to talk about doing table joins on Hadoop. The current model of Hadoop presents a problem for doing Joins for different tables/datasets.

The key problem is that both mapper/reducer takes in only a single key/value pair. A simple workflow would look like this.



The mapper input is text files, for Hadoop to do an intelligent join we need the mapper to distinguish between two different files just looking at data. I have been using a hack to distinguish my data types based on number of colums in my CSV data files. It breaks when I have to join two tables with same number of columns and types.

Can Hadoop support multiple Mapper classes??

If we can make Hadoop support multiple mappers I can define one class to handle one type of data and the other to other kind of data. Till the time they are churning out same 'key' type for the intermediate output and input to sorter we should be good.

I will try and escalate in Hadoop forums !! what do folks think about this ??

No comments: