Recenly I’ve been enjoying coding (in C again!) some problems from Project Euler.
My approach to solving this problems is (more or less) this one:
- Figure out how ‘big’ the data is:
- Does the result/partial data fits a particular datatype?
- Will it overflow during calculations?
- Might calculations bring some errors after a lot of repetitions? (hint: floating point).
- Figure out how to do the task:
- Brute force? (used to be a bad idea)
- Is there any algorithm that already fits that purpose? (answer: always yes)
- Is it viable / worth / interesting to code it? (what’s better Miller-Rabin primality test or Eratosthenes sieve).
- How can get some test vectors? I Wolfram alpha and Maxima are great for this.
Some stuff I didn’t expected:
I didn’t remember how fast is C.
I find C fast even if I compile and run the executable with a one-liner like this (the equivalent of running a ruby/php/perl program):
gcc -Wall input.c ; ./a.exe
My C workflow is very different than my Ruby workflow
- When I code in ruby, I usually have an IRB console next to the editor. I punch code on the REPL first, then I add I send it to the source file.
- In C, I think about the procedure and type the loops; then I adjust the variable values for boundary conditions.
In both cases I like having pencil and paper at hand.
I don’t miss the IDEs
In the past I’ve used Komodo IDE for PHP and I really enjoyed the experience; but for Ruby and C I find more comfortable using SublimeText or vim.
I’ve tried Visual Studio 2014; but I don’t like creating a console application project for a program with a single source file.
I don’t miss the debugger
Back in 1991, I used the debugger in Turbo C for almost everything. That’s how I learnt coding C and I expected to do the same now.
To my surprise I’ve started using some kind of test-driven-development.
I thought writing tests before your program required some discipline… but If you throw away your debugger and your only tool is the humble printf(3) call, that discipline comes to you naturally.
The joy of programming
I love puzzles and I love programming.
Project Euler makes you think a lot about machine limits, and how to work with limited space and resources. I think this is great as a warm-up for serious Big Data or Cloud programming.
We’re not used to think computer power and resources are _finite_ and It’s both humbling and refreshing to see it.
I didn’t expected T-Mobile to show a wifi-calling ad during the superbowl.
I learnt about wifi calling in November talking with the guys at Ericsson R&D center in Madrid.
The idea is very simple: instead of using a regular LTE connection, your device (iPhones and high-end Android devices) connects to the wifi you already have in your house and opens an encrypted tunnel with your mobile carrier infrastructure.
Wifi calling is great for telcos because with a software upgrade in their infrastructure they can have good voice coverage indoors where cell signal is not good, and the cost of improving the network is not cost effective; but you can still reach that place with broadband (ie: big residential areas)
The catch for telcos? during a wifi call you’re not roaming: the tunnel ends in your local operator infrastructure. This means it’s a local call.
I find this great for airports that give you some free wifi minutes (like Zurich, and Madrid).
I’ve playing a bit with the NetBSD rumpkernel I discovered some time ago in Hacker News.
I find fascinating the possibility of having an application with least OS code needed to run.
In a Rumpkernel the machine starts your application in kernel space: This has some interesting implications and issues that you must address in the design phase.
- Better performance: less code to run means faster execution of the same application code.
- Lower costs (think about the cloud).
- Fast application startup times
- Security: less code = lower attack surface.
There are also some issues:
What’s my IP? Ok, the kernel can get one for me at boot time with DHCP. Some hypervisors (like VMWare) allow you to assign fixed IP addresses to virtual machines.
Storage: Where’s the data?
The rumpkernel *usually* boots from a CD/ISO image. There’s no writable data disk, except for a ramdisk built during the compilation.
To save some persistent data, you can install your rumpkernel into a virtual disk image (for instance .vmdk). This way you can package your application as a full virtual machine.
Manipulation: How do you configure a rumpkernel application?
We have three options:
- You can’t configure anything. Everything is static (great for demos, bad for production).
- Use an embedded snmp / rest / webservices layer inside the application. This looks a bit complex…
- Configure only on startup. Reboot if you need a configuration change.
The last option looks good: you can mount an NFS partition with the current configuration at boot time. Which one? the one you get from DHCP.
Inventory and cooperation
In case you need to run several instances of your rumpkernel, you will need to have some inventory.
If your application might run for a long time (minutes), a dynamic DNS update might be enough.
If you’re launching and destroying virtual machines like crazy, you’ll need something more sophisticated. Maybe a queue system might help to build the coordination mechanism.
You need a DHCP server, an NFS server, and a dynamic DNS or a message queue to deploy an multi-process rumpkernel application. Some orchestation software will help, too
I can’t remeber how old I was when I started coding. Maybe I was 9 or 10 years old.
Back then, we coded in basic directly in a spectrum 48k computer via a horrible command-line interface that we found (now surprisingly) pleasant.
Nobody did anything to test their programs. You just run it, typed something on the keyboard, and it worked (or not).
Nowadays stuff is not that simple. You’ve got to test a lot your software because you don’t have control on where and how your app will be used.
How to make software testable?
If you started coding back in the 80s, there’s almost impossible to resist the urge to have a quick’n dirty prototype just to see results (yep, there was no REPL available those days).
So, after having *some* results (without formal testing). It’s easy to convert your mini app to something testable.
This is the “trick”:
- Refactor your quick’n dirty protoye into two pieces:
- A shell with minimal functionality that looks like pseudocode. If you’re coding in C, your app main() function will be there. A bunch of simple code that you can test (and hopefully reuse in the future).
Strip everything from your codebase until looks like pseudocode. This pile of code is almost untestable. Just make a formal technical revision on it and let it live (or die if it’s not okay).
All code that doesn’t read like in english or pseudocode should be “promoted” to a function and stored in a separate file.
Compile and run the app. Everything works as before, right?
When you have everything modeled as a function, testing is easy: just make a program that feeds that function with all the data you can imagine and compare it to the expected result. You don’t need a testing framework to do this: a simple program might do it.
BTW – if your code has global variables, or uses singletons you should refactor your logic to remove them before making anything “testable”.
I’ve just finished reading a review copy of Talend for Big Data, courtesy of Packt Publishing. I’ve been using Talend for ETL and automation tasks for some years and I wanted to start using it to feed data into a small hadoop cluster we have, so I think I can be able to put myself on this book readers shoes easily.
Book structure: a journey in Big Data
I’ve enjoyed the book follows a real use case of sentiment analisys using twitter data: I was getting tired of examples word counting / term extraction examples found in other Hadoop texts.
Although the book doesn’t describe in depth how to get the data from the twitter API using a Talend component (there are many available for this task), I think the information is enough to follow the steps in the book: Keep in mind the use case is an excuse to work with talend and big data.
The structure is very straightforward and It resembles closely a real world Big Data integration job:
- The basics: what’s Talend, what’s hadoop, and how to get started (terminology and setup)
- How to get data into a hadoop cluster (there’s a component for that: tHDFDOutput)
- Working with tables (hive) in Talend using Hive.
- Working with data using Pig.
- Loading results back to an SQLdatabase using Apache Sqoop
- And finally, how to industrialize this process.
In the real world you’ll surely choose between Hive and Pig to make your project simpler. Having a chapter for hive and another for pig lets you see and compare both technologies and helps you choose the one you feel more comfortable working with.
I’ve also found very interesting using Apache Sqoop to getting the data out of Hadoop back to the SQL World.
I didn’t know about Sqoop before reading the book and I was tempted to extract the data from Hadoop using a Talend job as a bridge. Dont’ do IT!. Using Sqoop is much better because it can paralelize the load job. It remembers me how to make backups using a disk cabin vs using a server agent (just tell the cabin to do the backup by its own vs copying all the data to a point and move it around).
- Contexts! I’ve ever thought the best part of Talend are contexts and I find great to see all the examples in the book using contexts since the beginning.
- In chapter 4 we learn how to use UDF (user-defined-functions) with Hive inside Talend. In the book the problem it solves is Hive does not support regular expressions; but It gives us a clue that may allow us to do something with interesting with other kinds of data, like images or audio files.
- The way Talend works with Pig is easier that I expected. Why? because you dont’ need to know anything about Pig latin code to get results. I expected something more complicated. In fact, I thing I’m going to use tPig* components more frequently than the Hive ones.
- The chapter about using Sqoop with Talend. For me, this chapter just justifies buying the book because it saves you a lot of time.
- I discovered in the book that Talend doesnt include all the JARs needed to work with Hadoop. This is not a technical problem per se; but a legal one: Talend cannot distribute the hadoop files under their own license. Fortunately the guys from Talend have made available a one-click-fix.
- At first glance I found the book short. Maybe I’m used to technical books with a lot of literature and this book has a very practical how-to-make-things-happen approach. I hope to see a second edition soon with dedicated to Google Big Query (which, by the way, is supported by Talend in the latest release with its own set of components).
Conclusion: concise, hands-on book about data integration with Talend and Hadoop. Highly recommendable even if you just want to extract data from an existing hadoop cluster.