I’ve just finished reading a review copy of Talend for Big Data, courtesy of Packt Publishing. I’ve been using Talend for ETL and automation tasks for some years and I wanted to start using it to feed data into a small hadoop cluster we have, so I think I can be able to put myself on this book readers shoes easily.
Book structure: a journey in Big Data
I’ve enjoyed the book follows a real use case of sentiment analisys using twitter data: I was getting tired of examples word counting / term extraction examples found in other Hadoop texts.
Although the book doesn’t describe in depth how to get the data from the twitter API using a Talend component (there are many available for this task), I think the information is enough to follow the steps in the book: Keep in mind the use case is an excuse to work with talend and big data.
The structure is very straightforward and It resembles closely a real world Big Data integration job:
- The basics: what’s Talend, what’s hadoop, and how to get started (terminology and setup)
- How to get data into a hadoop cluster (there’s a component for that: tHDFDOutput)
- Working with tables (hive) in Talend using Hive.
- Working with data using Pig.
- Loading results back to an SQLdatabase using Apache Sqoop
- And finally, how to industrialize this process.
In the real world you’ll surely choose between Hive and Pig to make your project simpler. Having a chapter for hive and another for pig lets you see and compare both technologies and helps you choose the one you feel more comfortable working with.
I’ve also found very interesting using Apache Sqoop to getting the data out of Hadoop back to the SQL World.
I didn’t know about Sqoop before reading the book and I was tempted to extract the data from Hadoop using a Talend job as a bridge. Dont’ do IT!. Using Sqoop is much better because it can paralelize the load job. It remembers me how to make backups using a disk cabin vs using a server agent (just tell the cabin to do the backup by its own vs copying all the data to a point and move it around).
- Contexts! I’ve ever thought the best part of Talend are contexts and I find great to see all the examples in the book using contexts since the beginning.
- In chapter 4 we learn how to use UDF (user-defined-functions) with Hive inside Talend. In the book the problem it solves is Hive does not support regular expressions; but It gives us a clue that may allow us to do something with interesting with other kinds of data, like images or audio files.
- The way Talend works with Pig is easier that I expected. Why? because you dont’ need to know anything about Pig latin code to get results. I expected something more complicated. In fact, I thing I’m going to use tPig* components more frequently than the Hive ones.
- The chapter about using Sqoop with Talend. For me, this chapter just justifies buying the book because it saves you a lot of time.
- I discovered in the book that Talend doesnt include all the JARs needed to work with Hadoop. This is not a technical problem per se; but a legal one: Talend cannot distribute the hadoop files under their own license. Fortunately the guys from Talend have made available a one-click-fix.
- At first glance I found the book short. Maybe I’m used to technical books with a lot of literature and this book has a very practical how-to-make-things-happen approach. I hope to see a second edition soon with dedicated to Google Big Query (which, by the way, is supported by Talend in the latest release with its own set of components).
Conclusion: concise, hands-on book about data integration with Talend and Hadoop. Highly recommendable even if you just want to extract data from an existing hadoop cluster.