Tuesday, September 13, 2011

Offloading data from the JVM heap (a little experiment)

Last time, I wrote about the possibility of using Linux shared memory to offload cacheable/reference data from the JVM. To that end I wrote a small Java program to see if it was practical. The results were better (some even stranger) than I had expected.

Here's what the test program does:

  • Create a bunch of java.nio.ByteBuffers that add up to 96MB of storage
  • Write ints starting from the first buffer, all the way to the last one - that's writing a total of 96MB of some contrived data
  • For each test, the buffer creation, writing and deletion is done 24 times (JIT warm up)
  • For each such test iteration, measure the memory (roughly) used in the JVM heap, the time taken to create those buffers and the time taken to write 96MB of data
  • Obviously, there are things here that sound fishy to you - like why use ByteBuffers instead of just writing to an OutputStream or why write to the buffers in sequence. Well, my intentions were just to get a ballpark figure as to the performance and the viability of moving data off the JVM heap
About the test:
  • There are really 5 different ways to create the buffers. Then there are 2 variations of these tests in which the buffer sizes vary (blocks), but the total bytes written are the same
  • The buffers (blocks) for each variation are created as:
    • Ordinary HeapByteBuffers inside the JVM heap itself - as a baseline for performance
    • DirectByteBuffers
    • A file created on Ext4fs using RandomAccessFile and parts of the file are memory mapped using the FileChannel. The file is opened in "rw" mode. Other options are "rwd" and "rws"
    • The same as above but the file resides in /dev/shm the in-memory based, shared memory virtual file system (Tmpfs)
    • The buffers are created using Apache's Tomcat Native Libraries which in turn use Apache Portable Runtime libraries. The Shared memory (Shm) feature was used to create the buffers. This is similar to DirectByteBuffers but the buffers reside in a common area, in OS memory and not owned by any but shared between processes (Similar to /dev/shm but without the filesystem wrapper overhead)
  • The machine used to test was my moderately powered Windows 7 home laptop with 8GB RAM, 2.3GHz i5 running a Cloudera Ubuntu Linux VMWare Player. There were a few other processes running, but nothing that was using CPU extensively. 500MB+ memory was free and available
  • The VM had 1GB RAM and the JVM heap was 256MB
  • The test program was run once for each configuration, but each test itself ran 24 times to allow the JIT to warmup and even the file system caches to stay warm where needed
  • The test prints out the timings with headers which were then compiled into a single text file and then analyzed in RStudio


  block_size                      test_type perctile95_buffer_create_and_work_time_millis perctile95_mem_bytes
1       4096                         direct                                       1555.65              3047456
2       4096 file_/dev/shm/mmap_test.dat_rw                                        661.70              3047632
3       4096          file_mmap_test.dat_rw                                       2055.75              3047632
4       4096                           heap                                       1071.15            102334496
5    4194304                         direct                                        653.85                 3008
6    4194304 file_/dev/shm/mmap_test.dat_rw                                        561.40                 3184
7    4194304          file_mmap_test.dat_rw                                       3878.25                 3184
8    4194304                           heap                                       1064.80            100664960
9    4194304                            shm                                        678.40                 2496

Interpretation of the results:
  • The test where block size was 4KB had quite a lot of memory overhead for the non-Java-heap ByteBuffers. Memory mapping was also slow for these small sizes as the Javadocs itself says for FileChannel.map()
  • The JVM heap test was slower than I had expected (for larger ByteBuffers). I was expecting that to be to fastest. Perhaps it was the small memory (1GB) virtualized OS it was running in. For smaller block sizes approaching 1K or less, the JVM heap performance is unbeatable. But in these tests, the focus was on larger block sizes
  • The Apache shared memory test would not even start for the 4KB tests as it would complain about "not enough space"
  • Almost everything fared well in the larger 4MB test. The per-block overhead was less for the off-heap tests and also the performance was nearly identical for /dev/shm, Apache Shm and DirectByteBuffer
  • The sources for this test are available here.
  • To run all the tests except Apache Shm you only need to compile JavaBufferTest and run it with the correct parameters
  • To run all tests, you can use the sub-class AprBufferTest which can test Apache Shm and also the remaining tests. To compile this you'll need tomcat-coyote.jar from apache-tomcat-7.0+. To run this you'll need the Jar file and the Tomcat Native bindings - tcnative.dll or libtcnative for Linux
There are advantages to using ByteBuffer outside the Java heap:
  • The mapped file or the shared memory segments can outlive the JVM's life span. Another process can come in and attach to it and read the data
  • Reduces the GC pressure
  • Zero-copy transfers are possible to another file, network or device using FileChannel.transferTo()
  • Several projects and products have used this approach to host large volumes of data in the JVM
  • The data has to be stored, read and written using primitives to the ByteBuffer - putInt(), putFloat(), putChar() etc
  • Java objects cannot be read/written like in a simple Java program. Everything has to be serialized and deserialized back and forth from the buffers. This adds to latency and also makes it less user friendly
Misc notes and references:
  • These tests can also be run on Windows, except for the Linux Tmpfs tests. However a RAM Drive can be used to achieve something similar



Peter Lawrey said...

Excellent article.

You could do your RandomAccessFile to a tmpfs or ramdisk filesystem. It can be faster in some write tests.

Ashwin Jayaprakash said...

@Peter the /dev/shm tests *are* using RandomAccessFile. I've said so in description of the tests.

CruiZen said...

Ashwin, I would like to reproduce the results in a VM I have access to. However, I don't see the test programs you've used (C and Java) in your svn respository.

Ashwin Jayaprakash said...

The source is present at the URL I've mentioned above. Google Code's directory listing is not so nice - it does not show you sub-directories - https://code.google.com/p/ashwinjayaprakash/source/browse/trunk/projects/buffer_shm#buffer_shm%2Fsrc%2Fshm