Sunday, 21 August 2016

Simple Website in Node.js for you Raspberry Pi 3

Now that I have a cluster of Raspberry Pi's my possibilities are endless. In this article I will show you how to create a simple website using Node.js with Express, Stylus and Pug

Node.js is is a platform built on Chrome's JavaScript runtime for easily building fast and scalable network applications, Express is a fast web framework for Node.js, Stylus is an innovative style-sheet language that compiles down to CSS and Pug is a succinct language for writing HTML templates.

To ease the pain of working with a Raspberry Pi, I will show you how to remote desktop it first for ease of use so you can code your website on the Pi itself without having to use the console.

The following operations have to be applied to every member of the cluster so Remote desktop is available to every node. The idea is to access each node via remote desktop using a laptop with windows 10.

Configure Remote desktop on the Raspberry Pi

The first thing to do is to update our version of Raspbian and ensure that all the packages are upgraded. This can be done with the following commands:

sudo apt-get update

Next, run the following command to upgrade any packages installed on your system that need upgrading:

sudo apt-get dist-upgrade

You'll have to perform this operation in all your cluster nodes. Now we are ready to install the packages we need: xrdp and samba. xrdp is an open source remote desktop protocol(rdp) server and Samba is the standard Windows interoperability suite of programs for Linux and Unix.

Run the following commands so we can remote desktop the Raspberry Pi's:

sudo apt-get -y install xrdp

The '-y' option will automatically answer yes to the default continue [Y/n] question.

Next step is to install the samba package so we will be able to access the Raspberry Pi's by its host name from Windows rather than by it’s IP address which changes as the node receives its IP address via DHCP:

sudo apt-get -y install samba

After the installation is successful, you should be able to ping the Raspberry Pi from your windows machine and perform the remote desktop:

If everything is configured correctly you should see the following screens:

To copy files from my Win10 machine to one of the nodes I use WinSCP. You can also create a shared folder on the Pi that's visible on your Win10 machine using Samba.

Now it's time to configure the rest.

Install Node.js

By default there is a pre-installed version of Node.js on the Raspberry Pi's. If you type node -v you'll see that the version of node.js is v0.10.29. We need to upgrade this version to a more recent one (v6.3.1 by the time I published this article).

Now our Raspberry Pi is ready for action. Let's see what are the next steps to create our Website.

To make things easier for you, I've created a project on Github that contains a sample website that you can use to start with. Installing the dependencies required through npm is a bit cumbersome so using a sample project makes things a bit easier.

Here is a screenshot of the site once it is up and running:

I'm using the site as a Raspberry Pi status monitor where there is a bit of javascript that pings each node on the grid. Then I use knockoutjs to bind the results to the page.

Once you've downloaded the repository, you only need to run the following commands to install the dependencies and run the website:
You can follow the instructions on my Github project:
The site in action:

Here is the list of installed packages for your reference:


Sunday, 14 August 2016

Raspberry Pi 3 Cluster Test

The following article describes a simple test that was executed on a 4 node (1 controller + 3 workers), Raspberry Pi cluster. The purpose is to obtain reproducible measures of MPI performance that can be useful to MPI developers.  If you haven't read my article about building a Raspberry Pi 3 cluster for parallel programming, you can find it here. The test is a matrix multiplication where each node will perform the calculations of a slice of it and send the results back to the main node. The test will play with 2 main variables: a) the size of the matrix and b) the number of nodes to use to perform the calculations. This should give us the time for each calculation and the speedup.

If you haven't seen my cluster yet, here is an image:

Test Description

The test consist of the following: 
The application generates two square (NxN) matrices A and B of a variable size and defined via arguments. Matrix B is by default visible to each node so we save time sending the array to each node. Then Matrix A is generated in the master node and sliced into several chunks and sent to each individual node of the cluster. The slicing is calculated in the master node. Once each individual node of the cluster has finalised with its calculations, they send the results back to the master node to combine the results and present the resultant matrix.

The slicing mechanism works as follows:

For the example above, imagine that we have a square matrix of size 6x6. We have 4 nodes in our cluster but only three of them are available for calculations. Node 0 or master is just there to arrange initial calculations, send the values to each node and then gather the results from each individual node and display results.

The architecture is quite simple but very common in these scenarios. The beauty of it is that we can increase the number of nodes in the cluster without having to change a single line of code in the application.

As we have a 6x6 matrix, we need to split that by the number of nodes available in the system. Notice that the size of the matrix needs to be divisible by the number of nodes available in the cluster. In this case we have 6 rows and 3 nodes, so there will be 2 rows of data for each node.


You can find all the source code and results on my github project:

In there you will find the source code of, the shell scripts that I used to run the tests, the logs and excel files that I used to gather all the details from each node.

The first step is to calculate the matrix multiplication using just 1 node and then see what's the speedup by using additional nodes.

The sizes of the matrices for this test are defined below:

  • 12x12
  • 60x60
  • 144x144
  • 216x216

Each matrix will be run against 3 nodes and from 1 to 4 cpus on each node. Every cycle of the application runs 10 times and we use the average value for defining our results.

Here are the results for the calculations above against 1 node:

Time is in seconds and we can see that the bigger the matrix, the longer it takes to be multiplied. Remember that the complexity for a matrix multiplication is O(n3). We can easily how the graphic tends to draw a cubic function. Just increasing the size of the matrix by 50% we increased the calculation time by 300%.

Here you can see the calculation that the application performs:
Let's see what happens when we run the same matrices against our cluster:

Matrix multiplication against 3 nodes (1 CPU each):

As expected we've reduced one third the execution time for our calculations.

Let's see what happens when we introduce more CPUs:

Matrix multiplication against 3 nodes (2 CPU each):

Matrix multiplication against 3 nodes (3 CPU each):

Matrix multiplication against 3 nodes (4 CPU each):

Notice that the RPI3 has 4 CPUs and we can control the number of CPU used through the machinefile and MPI. All the cpus are defined as a node in my machinefile and I made sure that each CPU was working while monitoring them. Below is a graph showing all four cpus working on one of my nodes while running the experiment 216x216 on 12 CPUs:

Here you can see an example running 3 CPUs on each PI. Notice how the CPU's reach 100% on each PI.

Here a sample script to grab the cpu usage for linux:
If we group the graphs together we have:

We can see that the highest throughput is achieved by splitting the matrix using as many nodes as possible and return the results back. Notice that time is not linear in this case as we would suppose to go down to 4s of calculations for each node but we go down to 8s instead (calculation of 216x216 against 12 cpus). We need to consider also that there is an overhead when running MPI and this needs to be taken into consideration. In any case the throughput can be seen in the following figure representing the speedup:

Using 4 CPUs per node gives the highest throughput with a speedup of 6.34. Speedup is calculated with the division of the SeqTime/ParaTime. With this configuration we achieve an 85% of time reduction for our calculations, allowing us to perform large calculations under seconds.

There are loads of tests still to perform on the cluster and this is just a simple example as to how to code a simple example into parallel computing. Please see my project on github for more info and reference.