Raw Binary Data in Julia
|
Keith Rutkowski ⋅
|
Binary data in Julia
Building on what was introduced in a previous article, we will now detail how to easily create simple and efficient binary IO code in Julia. Binary data formats are often encountered with file and networking IO. Usually low-level C libraries exist to deal with common data formats, but because of licensing differences, incompatible versions, or portability issues they may not be reliably used. Working with binary data using C is rather easy to do since the data layout matches the memory layout of user-defined types.
Unfortunately, Julia offers no built-in support for directly working with packed binary data in an efficient way. In contrast, the Python Standard Library provides the struct module with facilities to pack and unpack binary data. We will demonstrate below how CBinding.jl, which was created to add proper support for C constructs to Julia, is an essential tool when working with binary data in Julia.
Features covered in this article include:
-
creating an object from an IO stream,
-
performing both in-memory and direct on-disk data manipulation,
-
efficient zero-copy IO, and
-
byte alignment and bit packing capabilities.
A simple example
The WAV audio file format will be used here since it is a simple and ubiquitous file format.
The format is binary of course, but the header data doesn’t require any bit packing, byte alignment, or changes to byte order.
By using CBinding.jl and the @cstruct
macro it provides, we define a Julia type that is analogous to the C type shown in the comment to the right.
julia> using CBinding
julia> @cstruct WAV_header { # struct WAV_header {
riff::UInt8[4] # uint8_t riff[4];
fileSize::UInt32 # uint32_t fileSize;
fileHeader::UInt8[4] # uint8_t fileHeader[4];
fmtMarker::UInt8[4] # uint8_t fmtMarker[4];
fmtLength::UInt32 # uint32_t fmtLength;
fmtType::UInt16 # uint16_t fmtType;
dataChannels::UInt16 # uint16_t dataChannels;
dataSampleRate::UInt32 # uint32_t dataSampleRate;
dataBytesPerSecond::UInt32 # uint32_t dataBytesPerSecond;
dataBytesPerSample::UInt16 # uint16_t dataBytesPerSample;
dataBitsPerSample::UInt16 # uint16_t dataBitsPerSample;
dataHeader::UInt8[4] # uint8_t dataHeader[4];
dataSize::UInt32 # uint32_t dataSize;
} # };
Next, we open a sample WAV file and read the header exactly as it was defined.
julia> header = open("sample.wav") do io
read(io, WAV_header)
end;
julia> header.fileHeader[] |> String
"WAVE"
julia> header.dataBitsPerSample |> signed
16
julia> header.dataChannels |> signed
1
julia> header.dataSampleRate |> signed
22050
shell> file sample.wav
sample.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz
Comparing the values of the header’s fields with what is reported by the file
command indicates a successful parsing of the binary header data.
Changing byte order
In our example the file format byte order (little-endian) happens to be the same as the host system’s byte order, but that is not always the case. Ensuring the correct byte order results in safer, more portable code, and it is easy to do using the following Julia functions:
ltoh(x)
-
Convert
x
from little-endian byte order to host’s byte order. ntoh(x)
-
Convert
x
from big-endian byte order to host’s byte order. htol(x)
-
Convert
x
from host’s byte order to little-endian byte order. hton(x)
-
Convert
x
from host’s byte order to big-endian byte order.
So, to more correctly read the header’s fields, the code would look like this:
julia> header.dataBitsPerSample |> ltoh |> signed
16
julia> header.dataChannels |> ltoh |> signed
1
julia> header.dataSampleRate |> ltoh |> signed
22050
Wrapping byte arrays
Occasionally the layout of binary data is not static and depends on values found within the data itself.
In such cases, the user would read some block of data and then (re-)interpret that byte array as it is inspected.
CBinding.jl also provides an unsafe_wrap
method to create a user-defined view of a byte array.
It does not take ownership of the data, so the original data reference must be kept.
julia> data = Vector{UInt8}(undef, sizeof(WAV_header));
julia> open("sample.wav") do io
readbytes!(io, data)
end;
julia> header = unsafe_wrap(WAV_header, pointer(data));
Efficient zero-copy IO
By combining memory mapping with the facilities presented so far, we can achieve optimal IO performance. Memory mapping a file essentially makes the contents of the file accessible as a byte array in memory. The operating system handles the mapping and transparently performs the reads and writes, so you don’t actually need to read the whole file into memory to get high performance in random access use cases.
Julia has the standard library package Mmap
that provides the mmap
function.
We use it below to create a byte array mapped to an on-disk file and then use unsafe_wrap
to interpret the byte array as a WAV_header
object.
julia> using Mmap
julia> data = open("sample.wav", "r+") do io
Mmap.mmap(io, Vector{UInt8}, 256)
end;
julia> header = unsafe_wrap(WAV_header, pointer(data));
julia> header.fileSize |> ltoh |> signed
440634
We can even update the file on-disk simply by changing the header’s fields.
shell> hexdump --canonical --length=48 sample.wav
00000000 52 49 46 46 5e b9 06 00 57 41 56 45 66 6d 74 20 |RIFF^...WAVEfmt |
00000010 10 00 00 00 01 00 01 00 22 56 00 00 44 ac 00 00 |........"V..D...|
00000020 02 00 10 00 64 61 74 61 3a b9 06 00 00 00 00 00 |....data:.......|
julia> header.riff[1] = 'r' |> htol;
julia> header.fileSize = 1000 |> htol;
shell> hexdump --canonical --length=48 sample.wav
00000000 72 49 46 46 e8 03 00 00 57 41 56 45 66 6d 74 20 |rIFF....WAVEfmt |
00000010 10 00 00 00 01 00 01 00 22 56 00 00 44 ac 00 00 |........"V..D...|
00000020 02 00 10 00 64 61 74 61 3a b9 06 00 00 00 00 00 |....data:.......|
Advanced usage
The basic facilities demonstrated above should already simplify your IO code. Other more advanced resources provided by CBinding.jl include bit fields, field byte alignment, and packing strategies all of which tend to be used more frequently in networking protocols. The definition of an IP header below, though it is rather contrived, illustrates some of these features.
julia> @cstruct IP_header { # struct IP_header {
(vers:4, hdrLen:4, svc:8)::UInt32 # uint32_t vers:4, hdrLen:4, svc:8;
(len:16)::UInt32 # uint32_t len:16;
(ident:16)::UInt32 # uint32_t ident:16;
(ctrlFlags:3, fragOff:13)::UInt32 # uint32_t ctrlFlags:3, flagOff:13;
(ttl:8, proto:8)::UInt32 # uint32_t ttl:8, proto:8;
(hdrChksum:16)::UInt32 # uint32_t hdrChksum:16;
srcAddr::UInt32 # uint32_t srcAddr;
dstAddr::UInt32 # uint32_t dstAddr;
} __packed__ # } __attribute__((packed));
julia> data = zeros(UInt8, sizeof(IP_header));
julia> header = unsafe_wrap(IP_header, pointer(data));
julia> header.ctrlFlags = 0x7;
julia> header.len = 0x1234 |> hton;
julia> header.srcAddr = 0x7f000001 |> hton;
julia> header.dstAddr = 0x7f000001 |> hton;
julia> data'
1×20 LinearAlgebra.Adjoint{UInt8,Array{UInt8,1}}:
0x00 0x00 0x12 0x34 0x00 0x00 0x07 0x00 0x00 0x00 0x00 0x00 0x7f 0x00 0x00 0x01 0x7f 0x00 0x00 0x01
If you are considering the transition to Julia, but have several C libraries or binary file formats you depend on, we can help! Analytech Solutions offers your team many years of experience working with both Julia and C, and we can streamline your transition process. Please contact us for more information!
Keith Rutkowski is a seasoned visionary, inventor, and computer scientist with a passion to provide companies with innovative research and development, physics-based modeling and simulation, data analysis, and scientific or technical software/computing services. He has over a decade of industry experience in scientific and technical computing, high-performance parallelized computing, and hard real-time computing.