This page last updated: 3 March 2018.Back in 2017, at a Cambridge Wireless meeting, some colleagues of mine happened to be talking to Rob Morland, who is involved in something called the A1 Steam Locomotive Trust. The trust has built a brand new £3m steam locomotive, the Tornado, which runs on the national rail network. Rob was wondering whether it was possible to stream the live sound of the steam engine to those on the track side etc. waiting for it to come past. For some reason my colleagues pointed him at me and hence a project was born.
The design is dictated by these requirements:
The architecture of the system looks something like this:
I'm a software engineer and so I wanted to, as far as
possible, avoid any potential issues with hardware
design. By far the simplest approach, especially since
there was an intention to show Internet Of Things behaviours,
is to use an I2S microphone such as the InvenSense
ICS43434. Not much bigger than a grain of rice, this
microphone can be powered from 3.6 to 1.8 Volts and
provides a completely standard Phillips format I2S digital
output that can be read by any microcontroller with an I2S
interface. Audio is 24 bit and capture rates can be
at least 44 kHz.
I experimented with various bit depths, capture frequencies
and coding schemes. From a capture point of view,
24 bit is somewhat high so I compromised at
16 bit. In terms of capture frequencies, while
44 kHz is CD audio quality that is again going to be too
high for a cellular network and so I compromised at
16 kHz. With a raw PCM transport, ignoring
overheads, this would require a constant 256 kbits/s
uplink on the cellular interface. This is definitely on
the large side: cellular networks may offer links of this
bandwidth on the downlink, uplink is an entirely different
matter. However, I didn't want to go any lower than this
in quality terms and so the next variable is the audio coding
While it would be theoretically possible to MP3 encode at
source, that is a processor intensive operation and MP3 is
neither stream oriented nor loss tolerant; it is coded in
blocks of 1152 samples and the audio content is
interleaved across many blocks so losing a single block has a
Jonathan Perkins at work suggested I adopt a NICAM-like
was the first scheme used by the BBC for broadcasting digital
multi-channel audio at a controlled quality, allowing stereo
audio to be broadcast for the first time. It also
happens to be very suited to embedded systems. Basically
a chunk of samples are taken and the peak is worked out.
Then all the samples are shifted down so that every sample in
the block can fit into the desired NICAM bit-width. The
amount of shifting that was performed is included with the
coded block. At the far end the block is reconstructed;
any loss will always be in the lower bits of the block.
With a relatively short block the "gain window" moves such
that the loss is not noticable. I chose an 8 bit
NICAM width and a block duration of 1 ms
(16 samples). For a 16 kHz sampling rate this
results in an uplink rate of 132 kbits/s, which (by
experiment) is bearable.
In addition to the audio stream itself I borrowed from the
likes of RTP and included a sequence number, microsecond
timestamp and coding scheme indicator in the block header; I
called this URTP (u-blox Real Time Protocol).
I initially thought about using RTP or something similar but
I really did NOT want to have to write a mobile application
for this; it had to work out of the box with existing mobile
devices. The answer to this turns out to be HTTP
Live Streaming. This protocol, originally
developed by Apple, chops up an audio stream into segment
files each a few seconds long, which are MP3 encoded but with
a very specific header added so that the browser can
reconstruct them. There is then an index file which
lists the segments to the browser. No client application
is required, just a browser; the browsers of all Apple and
Android phones include HLS support.
In the original Internet Of Things plan I had assumed that
DTLS was going to be the security scheme of choice.
However, I experimented with sending uplink audio stream over
UDP and found that there were relatively significant losses,
several percent. Hence I decided that TCP was a better
bet for the audio stream. Then there's also the issue of
cellular networks: they sometimes perform deep packet
inspection and deny service to things they decide don't meet
their tariff model, they have quite active and unpredictable
firewalls and they don't allow incoming TCP connections (which
will be needed for control operations).
Jonathan came to my rescue again here with the answer to all
of these problems: SSH. SSH comes built into all Linux
platforms and allows the setting up of secure tunnels between
servers, even multi-hop, provided that you have an account on
each of the machines, which can be certificates based.
You generate an SSH key on the Raspberry Pi and then push it
to the server. The Raspberry Pi can then use SSH to set
up tunnels from its port X to port Y on the server and, also,
setup up tunnels in the reverse direction, from port A on the
server to port B on the Raspberry Pi. The tunnels are
secure, can be configured to include keep-alives and restarts,
and, should the private key on the Raspberry Pi ever be
exposed, the server can simply remove the public key from its
At the simplest level the server can include a certificate
so that a HTTPS connection is made but that doesn't answer the
problem of how permissions may be encoded for a paid-for
service. This needs some thinking.
There are a few sources of latency:
Hence, in the case where there are no cellular outages the
delay is largely dependent upon the duration of an MP3 segment
file plus some browser/HTTP behaviour uncertainty. By
experiment, with S set at 1 second (a 3.3 kbyte MP3
segment file, about the same length as the HLS index file in
fact), a best case end to end latency of around 2 seconds
can be achieved (tested using Chrome on a PC as the receiving
browser), though such short durations are likely to be unusual
and warrant further testing.
As soon as there is a cellular outage, the effect is to
increase the latency. Buffers exist in the Raspberry Pi
(5 seconds) and in the HLS MP3 files (4 seconds). The
problem is that there is no easy way to reduce the latency
once there has been a build-up, at least not without slipping
the audio stream to catch up, which would be perceptible to
users. This needs more thought.
For initial testing, the hardware consists of a Raspberry Pi
B+ (which I happened to have in my cupboard), a microphone on a
flexible strip evaluation board connected via a break-out board,
and a u-blox 2G/3G modem board from Hologram called the
Nova. A 2G/3G modem draws more current than the Pi can
provide (close to 3 Amps peak) and so I used a Y cable that
allows me to provide separate power to the modem while
testing. Then I moved all of this to a Pi Zero W since
that should have sufficient processing power but is smaller and
more robust. Here I used a USB hub with an Ethernet
connector built in as is wanted the flexibility of being able to
switch on/off an auxiliary network connection to take over from
cellular (and there's no physical switch to disable Wifi on the
Pi Zero W).
I used a Giff Gaff (Telefonica network) SIM, since they offer
an all-you-can-eat, pay-as-you-go data package for £20 per month
(which works out at about 15p per hour of audio streaming and
about 140 hours of streaming available per month).
The software comes in three parts (available on github with
There is, of course, quite a lot of configuration required on
the Raspberry Pi side (setting up SSH tunnels etc.), all of
which is covered in the README.md.
In order to meet the requirement that the only control of the
recording device is power on/off, the Raspberry Pi is also
configured to run from a read-only file system, preventing
potential SD card corruption from a disorganised shut-down.
To be completed.