Make your code more readable by using Breadth-First Programming, not Depth-First Programming

I recently finished reading a book called The Art of Readable Code by Boswell and Foucher. Overall, it was a good book on programming style – touching on many of the timeless pearls of wisdom that Steve McConnell wrote about in Code Complete using a more modern flair. One chapter that really piqued my interest was Chapter 10 – Extracting Unrelated Subproblems. This chapter touched on some really key points for enhancing your code’s readability: identifying unrelated subproblems, techniques for extraction, decoupling, watching out for when you inadvertently take things too far, etc.

All in all, the purpose of extracting unrelated subproblems is, as Steve McConnell put it, to reduce complexity – Software’s Primary Technical Imperative (i.e. Keep it Simple). Techniques used for achieving this are things like using proper layers of abstraction, ensuring a method has proper cohesion, Domain-Driven Design, and of course, as discussed by Boswell and Foucher, extracting unrelated subproblems, amongst other things.

For me, personally, the problem of needless complexity is often encountered during code reviews (CRs). One thing that I often struggle with is reverse engineering what the code is supposed to be doing based on the code as it is written. This can be really frustrating if the author isn’t immediately available, if there is a lot of code with no obvious design doc, and when high-level methods are riddled with needless low-level details that obfuscate the code’s intent.

To combat these types of issues such as the obfuscation of the code’s intent, unrelated subproblems, lack of method cohesion, and in general, needless complexity, there is one technique that I frequently use which really helps me: Use Breadth-First programming instead of Depth-First programming.

By breadth-first programming and depth-first programming, I’m not talking about using the breadth-first and depth-first traversal algorithms throughout my code. 😝 Instead, I’m talking about an approach to how we actually think about and write the code that we write into our IDEs.

For example, suppose I am tasked with writing a web service API for a legacy client that validates whether or not a booking should be accepted as legitimate. We’ve met with the stakeholders, done our design, and have come with the following sequential tasks to be performed by our “Verification Gateway”:

Transform the request XML payload into a POJO data model.
Enhance the POJOs with extra information – useful for downstream processing.
Call a downstream Verification Service for with our enhanced POJOs for further processing – used to determine whether the booking is truly legitimate.
Marshall the response back into a XML payload that the caller expects.

Regardless of how contrived this example is, this is our high-level description of what we want to do. To make our system as simple as possible, it would be great if we used the Breadth-First programming technique and wrote a simple high-level function that looked something approximating the following code:


public class VerificationGateway {
  public String verify(String xmlRequest) {
    BookingRequest bookingRequest = generateBookingRequestFromXml(xmlRequest);
    enhanceBookingRequest(bookingRequest);
    BookingResponse bookingResponse = verifyBooking(bookingRequest);
    String xmlResponse = generateXmlFromBookingResponse(bookingResponse);
    return xmlRespoinse;
  }
}

The above code has advantages that include the following:

It’s easy to see what the high-level intention of the code for an engineer who has little context when asked to do a CR.
It’s cohesive – having a consistent level of abstraction.
It keeps things simple.
It has extracted and identified the keyproblems subproblems to be solved – like generating a POJO form of the XML payload, enhancing the payload, verifying the booking, and converting the response back into an XML payload again.

For me, I tend to write high-level logic only when I force myself to think about the problem at a high-level, and focussing with pain on managing my own shiny object syndrome. Ironically, I find it hard to write simple high-level code.

Instead, I find that my natural unfortunate tendency is to write code like the following:


public class VerificationGateway {
  public String verify(String xmlRequest) {
    // Thought 1: I need a method that can parse an XML request and hydrate
    // a request. I'll write a method that does that first. Thus, I start writing
    // the `generateBookingRequestFromXml()` method without even finishing the
    // `verify()` method first.
    BookingRequest bookingRequest = generateBookingRequestFromXml(xmlRequest);

    // Thought 3: I need functionality that can add datum1 to the booking request.
    // Let's write that now.

    // functionality that adds datum2 to the booking request


    // Thought 4: I need functionality that can add datum2, ..., datum_n to the booking request.
    // Let's write that now.

    // functionality that adds datum2 to the booking request

    // Thought 5: Now that all that's done, I know that I need to call the Verification Service,
    // so let's write that now.

    // messy code that calls the Verification Service, most of which is within the scope of the
    // top-level `verify()` method.

    // Thought 6: I need a method that can parse out the XML payload as a string from my current
    // `BookingResponse` instance. That's clearly low-level logic that isn't business logic
    // related, so let's extract that method, a simple call.
    String xmlResponse = generateXmlFromBookingResponse(bookingResponse);
    return xmlRespoinse;
  }

  private BookingRequest generateBookingRequestFromXml(xmlRequest) {
    // Thought 2: Implement this method before I finish the `verify()` method, leaving a broken
    // `verify()` method until the very end of my implementation.
  }

  private String generateXmlFromBookingResponse(bookingResponse) {
    // Thought 7: Implement this method.
  }
}

As you can see, I added comments that allow you to gain a small glimpse into how certain undisciplined thinking leads to this kind of crap code. It’s also the kind of code that both myself, and engineers that I respect have written before. IMO, this type of code comes from depth-first thinking and Shiny Object Syndrome. In this second code sample immediately above, my comments delineate how I’m guilty of being distracted by the next low-level problem each time I encounter one. My verify() method is never finished until almost the very end of the implementation. Consequently, it’s really difficult to discern the requirements from this code. Most of the key logic is now in a bloated verify() method that really ought to just be a simple high-level method that delegates each of sub-problems to a downstream method.

But, when I force myself to think in a disciplined manner about my problem, and forcing myself to think in a high-level bread-first manner about a problem, I end up writing code that is pretty maintainable and simple. It also makes things easier of the engineers doing the CR because it’s easy to see what I’m trying to do. Sure there are probably errors in the lower-level problems, but the verify() method is now so simple that it becomes easy to reason that there are no errors in the high-level problem is being solved.

In sum, based purely on my own personal, anecdotal experience, I have that using a breadth-first approach to programming leads to better code that’s simpler and more maintainable. I hope you find the same. Thanks for reading.

Transdermal Magnesium vs. Oral Magnesium for Sleep

TL;DR

If you’re trying to get more energy and health in your life and work by improving your sleeping habits, don’t rely on transdermal magnesium to help you sleep better. It has no more proven benefits that oral magnesium supplementation.

Long-winded version

So lately I’ve been reading the book, Sleep Smarter, by Shawn Stevenson. Up to my research for this blog article, I feel that this book has been pretty good, but my view of it has been severely tainted based on my research for this article. It definitely contains a lot of good tips on sleeping well and it strikes home the message of why proper sleep is important. One part really piqued my interest – the part about magnesium. In Sleep Smarter, pages 57-8 of 256 on the Kindle, magnesium is referred to as “one mighty mineral”, an “anti-stress mineral”, it “helps to balance blood sugar, optimize circulation and blood pressure, relax tense muscles, reduce pain, and calm the nervous system.” According to Sleep Smarter, humans are chronically deficient in magnesium, as “estimates show that upwards of 80 percent of the population in the United States is deficient in magnesium.”

In the section on magnesium supplementation, one paragraph, in particular, is noteworthy.

Again, because a large percentage of magnesium is lost in the digestive process, the ideal form of magnesium is transdermal from supercritical extracts. You can find more information on my favorite topical magnesium, Ease Magnesium, in the bonus resource guide at sleepsmarter.com/bonus.

I did a little research of my own, and I could only find one credible scientific study on the effectiveness of transdermal magnesium – a study by members of the National Center for Biotechnology Information titled: Myth or Reality—Transdermal Magnesium? The first sentence of the article’s abstract reads as follows:

In the following review, we evaluated the current literature and evidence-based data on transdermal magnesium application and show that the propagation of transdermal magnesium is scientifically unsupported.

Basically, this NCBI article discusses how while oral magnesium supplementation has proven benefits, transdermal magnesium does not.

I traced Shawn Stevenson’s sources and was really disappointed. In my opinion, the cited sources in his bibliography effectively boiled down to little more than personal anecdotal evidence of bloggers.

This made me really disappointed in the Sleep Smarter book because it made me ask the following question: If the stuff about transdermal magnesium is total bullshit, then how confident can I be in all the rest of Shawn’s “facts”? Are they proven facts or are they just alternative facts? One of the biggest reasons for reading non-fiction is so that you don’t have to do the countless hours of research yourself. You, the reader, benefit from the hard-won research of others by paying a few bucks for that knowledge. In that light, I feel really let down by Sleep Smarter because I just don’t know what to believe and what not to believe. But at this point, my trust in the book is greatly diminished.

The aversion to using Java’s continue keyword

I’ve recently been reading The Art of Readable Code by Dustin Boswell and Trevor Foucher. So far, it’s a decent book. The Art of Readable Code professes many of the style recommendations (such as nesting minimization) that Steve McConnell’s classic, Code Complete 2nd Ed., contains, but with a more modern flair. However, one quote, in particular, piqued my interest while reading.

In general, the continue statement can be confusing, because it bounces the reader around, like a goto inside the loop.

Basically, amongst developers, there is a fair amount of consternation around the use of Java’s continue keyword. I remember getting into an argument with a colleague and friend of mine, Neil Moonka, about this very issue back when we worked at Amazon together. The summary of the argument was that I was using Java’s continue keyword excessively in a loop – multiple times, leading to somewhat confusing code. As I recall, my counterargument was a highly legitimate one (not at all stemming from ego or insecurities) that sounded something like: “Are you f#%king kidding me!? This code is f#%king perfect!” At the time, and honestly, to this day, I really don’t see much wrong with the continue keyword, in and of itself. IMO, it’s a nice complement to Java’s break keyword, and both keywords have their places. Unlike break, instead of jumping out of the loop, you just jump back up to the top of the loop. In fact, using the continue keyword is a nice way to avoid excess nesting inside your loops. I have a tendency to use continue as the equivalent of an early return statement inside loops.

That being said, this points to a potentially deeper issue with code that uses the continue keyword excessively. Namely, the issue is that methods that use the continue keyword excessively tend to lack cohesion. The level of abstraction of a method outside a loop tends to be different than the level of abstraction of what’s inside the loop. Since the method simultaneously deals with these two levels of abstraction, it lacks cohesion by definition.

To solve this problem, you can just refactor your code such that the innards of your loop’s logic is extracted into its own lower-level method.

For example, consider the following code:

for (Item item : items) {
  if (item != null && newState != null) {
    continue;
  }
  if 
  if (item.isValid() && State.isValid(newState)) {
    continue;
  }
  item.setState(newState);
}

Admittedly, the above loop is straight-line code, going directly from top to bottom without going diagonally. However, the higher level method that you don’t see could likely be made more cohesive by extracting the loop’s innards out into a new method. For example, consider the following revised code:

  for (Item item : items) {
    updateState(item, newState);
  }
  
  // ..
}

private void updateState(Item item, State state) {
  if (item != null) {
    return;
  }
  if (item.isValid()) {
    return;
  }
  if (state != null) {
    return;
  }
  if (State.isValid(state)) {
    return;
  }
  item.setState(newState);
}

With this new version of the code, the purpose of the inside of the loop is fairly obvious. Moreover, the main method doesn’t need to be concerned with the lower level details of the prerequisites of setting the state. Thus, the main method maintains its cohesion, which won’t be degraded by adding in low-level details of the prerequisites of setting the state of the item.

So while the continue keyword is useful, excessive use of it can be off-putting. When such occurrences arise, consider refactoring your code such that the loop’s logic is neatly encapsulated within a new method.

How coding in Java using the final keyword could have saved an airline

I’ve been reading the new version of Michael Nygard‘s book, Release It, 2nd Ed. I read the first version of the book shortly after I first became an engineer and the book impacted my work greatly. Thus far, the quality of the second edition is on par with the first edition. One of the first things that Nygard discusses is “The Exception that grounded an airline” – the name of chapter 2. In this chapter, Nygard discusses a software bug that caused immense trouble for the airline – an issue with the reservation system that caused several flights to be grounded. Nygard shows the buggy code, what was done to fix the issue, and what patterns could have been employed to prevent the issue in the first place. Funnily enough, though, none of the large scale patterns of “Release It” needed to have been employed to prevent the airline from being grounded. Believe it or not, all that needed to be done to prevent the issue was for the Java developers who wrote the system to have properly used Java’s final keyword.

At the crux of the issue, was the following code:

Connection conn = null;
Statement stmt = null;
try {
    conn = connectionPool.getConnection();
    stmt = conn.createStatement();

    // Do the lookup logic
    // return the list of results
} finally {
    if (stmt != null) {
        stmt.close();
    }
    if (conn != null) {
        conn.close();
    }
}

Nygard goes on to discuss how this code can be fixed by paying attention to the exceptions thrown by close() methods. But, as stated above, the code could also have been trivially fixed by using the final keyword judiciously.

All too often, I see code like this:


void queryDb(final String queryParam) {
    Connection conn = null;
    Statement stmt = null;

    // do stuff with `conn`, `stmt`, and `queryParam`, and conditionally set
    // `conn` and `stmt` to something non-null.
}

The only variables that should have been final in the above method were conn and stmt. The variable queryParam was pointless to make final. By making conn and stmt final, we make it easy to reason about the state of the variable. This is because we know that it’s never null when it’s final (unless its initial state is null); we know that its value is the initial value, nothing else is possible. By making queryParam final, all that is really happening is noise pollution. We know that queryParam is always going to have that initial value passed into the method, its a trivial parameter that never changes. So, in my opinion, using the final keyword on such variables just adds noise. Really, we wanted conn and stmt to be final because, ideally, those variables should only get set once. As such, consider the following code:

public void getMyDataFinally() throws SQLException {
    final Connection conn = this.connectionFactory.newConnection();
    try {
        getMyDataFinallyWithConnection(conn);
    } finally {
        conn.close();
    }
}

private void getMyDataFinallyWithConnection(Connection conn) throws SQLException {
    final Statement statement = conn.createStatement();
    try {
        // Do the lookup logic
        // return the list of results
    } finally {
        statement.close();
    }
}

The above code fixes the issue present in Nygard’s version of the code, without any null checks, and most importantly, without paying attention to what exceptions are thrown by the close() method of objects, which was the cause of the issue that crashed the airline. All that needed to happen was to make conn and stmt final, and to refactor the code in such a way that allowed the compiler to guarantee that conn and stmt could only be set once.

Note that the code is now more verbose – we have two methods, not just one. However, there are two benefits to using this approach:

The code is easier to follow. There are no null-checks. We don’t need to reason about the state and whether or not conn and stmt have been set. We don’t need to worry about whether a variable has been set properly before we operate on it.
The new methods are more cohesive. That is to say, they operate at different levels of abstraction. The getMyDataFinally() method operates at the level of abstraction pertaining to the connection, whereas the getMyDataFinallyWithConnection() method operates at the abstraction level dealing with the statement. Granted, it’s a pretty trivial difference. Nonetheless, this example demonstrates how making appropriate use of the final keyword forces one to factor their code in such a way that methods are more cohesive than they might otherwise be.

All in all, I really enjoy Nygard’s work. It’s an updated version of an industry standard with tons of wisdom. But, from my point of view, it’s interesting to think about how the final keyword could have saved millions of dollars for this airline. If you are reading this and don’t use the final keyword in your Java code well, it could probably save you from coding some bugs as well.

Notes from Gil Tene’s Talk: Understanding Java Garbage Collection and what you can do about it

This blog post has been a long time in the making. I’ve been a fan of Gil’s for some time and I’ve been meaning to go through this video and create notes, since I needed to really study this video for work. It’s been a while since I watched this and so I thought that these notes would be useful for somebody out there, not the least of whom is me. 🙂

If you’re a fan of the video, Understanding Java Garbage Collection and what you can do about it, I hope you will appreciate the notes below.

Understanding Java Garbage Collection and what you can do about it

This Talk’s Purpose / Goals

This talk is focussed on GC education
This is not a “how to use flags to tune a collector” talk
This is a talk about how the “GC machine” works
Purpose: Once you understand how it works, you can use your own brain…
You’ll learn just enough to be dangerous…
The “Azul makes the world’s greatest GC” stuff will only come at the end.

High level agenda

GC fundamentals and key mechanisms
Some GC terminology & metrics
Classifying currently available collectors
The “Application Memory Wall” problem
The C4 collector: What an actual solution looks like

Memory use

People using < 1/2GB largely for phones
Most people have heap sizes >= 1GB < 10GB, irrespective of application

Why should you care about GC?

Start with…

What is Garbage Collection good for?

Prevalent in modern languages and platforms
Productivity, stability
- Programmers not responsible for freeing and destroying objects
  - GC frees up time to debug logic code, as opposed to lower level concerns like correctly calling destructors
- Eliminates entire (common) areas of instability, delay, maintenance
Guaranteed interoperability
- No “memory management contract” needed across APIs
  - Garbage collection implies a memory management contract common to all objects in the heap.
- Uncoordinated libraries, frameworks, utilities seamlessly interoperate.
Facilitates practical use of large amounts of memory
- Complex and intertwined data structures, in and across unrelated components
- Interesting concurrent algorithms become practical…

Why should you understand (at least a little) how GC works?

The story of the good little architect

A good architect must, first and foremost, be able to impose their architectural choices on the project…
Early in Azul’s concurrent collector days, we encountered an application exhibiting 18 second pauses
- Upon investigation, we found the collector was performing tens of millions of object finalizations per GC cycle
  - We have since made reference processing fully concurrent…
Every single class written in the project has a finalizer
- The only work the finalizers did was nulling every reference field
The right discipline for a C++ ref-counting environment
- The wrong discipline for a precise garbage collected environment

Most of what People seem to “know” about Garbage Collection is wrong

In many cases, it’s much better than you may think
- GC is extremely efficient. Much more so than malloc()
- Dead objects cost nothing to collect
- GC will find all the dead objects (including cyclic graphs)
- …
In many cases, it’s much worse than you may think
- Yes, it really does stop for ~1 sec per live GB (in most JVMs)
- No, GC does not mean you can’t have memory leaks
- No, those pauses you eliminated from your 20 minute test are not gone
  - Need to be aware that your tuning is likely just delaying the inevitable pause that will happen – possibly making it worst than before when it does actually happen
- …

Trying to solve GC problems in application architecture is like throwing knives

You probably shouldn’t do it blindfolded
It takes practice and understanding to get it right
You can get very good at it, but do you really want to?
- Will all the code you leverage be as good as yours?
Examples:
- Object pooling
- Off heap storage
- Distributed heaps
- …
- (In most cases, you end up building your own garbage collector)

Some GC Terminology (IMPORTANT SLIDES) !!!

A Basic Terminology example: What is a concurrent collector?

A Concurrent Collector performs garbage collection work concurrently with the application’s own execution
- Concurrency exists relative to application
A Parallel Collector uses multiple CPUs to perform garbage collection
- For a parallel collector, concurrency exists relative to itself (i.e. it has more than one thread running concurrently), but concurrency doesn’t necessarily exist relative to application (i.e. GC threads executing might require suspending ALL application threads).
- The terms “Parallel” and “Concurrent” with respect to garbage collection are orthogonal terms.
A Stop-the-World collector performs garbage collection while the application is completely stopped
- STW is opposite of concurrent
An Incremental collector performs a garbage collection operation or phase as a series of smaller discrete operations with (potentially long) gaps in between
- Monolithic collection is effectively executing all incremental steps at once without letting application run in between increments.
Mostly means sometimes it isn’t (usually means a different fall back mechanism exists)
- Need to read it opposite way
- Mostly concurrent means sometimes STW.
- Mostly incremental means sometimes Monolithic.
- Mostly parallel means sometimes it’s not.

Precise vs. Conservative Collection

A Collector is Conservative if it is unaware of some object references at collection time, or is unsure about whether a field is a reference or not (or pointer vs. integer).
- Conservative Collector means that it can’t move objects around as much because it doesn’t know size to copy.
- Conservative collectors almost always run into fragmentation issues.
A collector is Precise (aka Accurate) if it can fully identify and process all object references at the time of collection
- A collector MUST be precise in order to move objects
- The COMPILERS need to produce a lot of information (OOP Maps)
All commercial server JVMs use precise collectors

Safepoints

A GC Safepoint is a point or range in a thread’s execution where the collector can identify all the references in that thread’s execution stack.
- Safepoint and GC Safepoint are often used interchangeably.
- But there are other types of safe points, including ones that require more information than a GC safe point does (e.g. de-optimization).
“Bringing a thread to a safe point” is the act of getting a thread to reach a safe point and not execute past it.
- Close to, but not exactly the same as “stop at a safepoint”
  - e.g. JNI: you can keep running in, but not past the safe point
- Safepoint opportunities are (or should be) frequent (i.e. micro-second frequent) (e.g. method boundaries and back-edges of loops).
In a Global Safepoint all threads are at a Safepoint
- Global safe points are where STW pauses occur.

What’s common to all precise GC mechanisms?

Identify the live objects in the memory heap
Reclaim resources held by dead objects
Periodically relocate live objects
Examples:
- Mark/Sweep/Compact (common for Old generations)
- Copying collector (common for Young Generations)

Mark (aka “Trace”)

Start from “roots” (thread stack, statics, etc.)
“Paint” everything you can reach from “root” as “live”
At the end of a mark pass:
- all reachable objects will be marked as “live”
- all non-reachable objects will be marked “dead” (aka “non-live”)
Note: work is generally linear to “live set” (not size of heap) (i.e. if I’m only using 2GB of a 4GB heap then marking is only scans the 2GB of used objects, not the 4GB of the heap).

Sweep

Scan through the heap, identify “dead” objects and track them somehow
- usually in some form of free list
Note: work is generally linear to heap size (not live set – unlike Mark phase).

Compact

Over time, heap will get “swiss cheesed”: contiguous dead space between objects may not be large enough to fit new objects (aka “fragmentation”)
- Conservative collectors encounter this issue.
Compaction moves live objects together to reclaim contiguous empty space (aka “relocate”)
Compaction has to correct all object references to point to new object locations (aka “remap”)
- Remapping is what makes compaction slow – because we need to change other live objects to point to new location.
- You want to move a lot of objects as once so that you can change all references of existing live objects to point to new re-mapped objects, so that you don’t have to compact too many times.
Note: work is generally linear to live set

Copy

A copying collector moves all lives objects from a “from” space to a “to” space and reclaims “from” space.
- Single pass operation
At start of copy, all objects are in “from” space and all references point to “from” space.
Start from “root” references, copy any reachable object to “to” space, correcting references as we go.
At end of copy, all objects are in “to” space, and all references point to “to” space.
Note: Work is generally linear to live set

Mark/Sweep/Compact, Copy, Mark/Compact

Copy requires twice the max size of live set in order to be reliable.
- From space gets potentially all moved to To space without any removals.
Mark/Compact (no sweep) (typically) requires twice the max live set size in order to fully recover garbage in each cycle.
Mark/Sweep/Compact doesn’t need extra memory (just a little) because it happens in-place.
Copy and Mark/Compact are linear to live set size
Mark/Sweep/Compact is linear (in sweep) to heap size
Mark/Sweep/(Compact) may be able to avoid some moving work
Copying is typically monolithic

Generational Collection

Based on Weak Generational Hypothesis: Most objects die young
Focus collection efforts on young generation:
- Use a moving collector: work is linear to live set
- The live set in the young generation is a small % of the space of the heap
- Promote objects that live long enough to older generations
Only collect older generations as they fill up
- “Generational filter” reduces rate of allocation into older generations
Tends to be (order of magnitude) more efficient
- Great way to keep up with high allocation rate
- Practical necessity for keeping up with processor throughput

Generational Collection

Requires a “Remembered set”: a way to track all references into the young generation from the outside.
Remembered set is also part of “roots” for young generation collection
No need for twice the live set size: Can “spill over” to old generation
Usually want to keep surviving objects in young generation for a while before promoting them to the old generation.
- Immediate promotion can significantly reduce generational filter efficiency.
- Waiting too long to promote can eliminate generational benefits.

How does the remembered set work?

Generational collectors require a “Remembered set”: a way to track all references into the young generation from the outside.
Each store of a NewGen reference into and OldGen object needs to be intercepted and tracked
Common technique: “Card Marking”
- A bit (or byte) indicating a word (or region) in OldGen is “suspect”.
Write barrier used to track references
- Common technique (e.g. HotSpot): blind stores on reference write
- Variants: precise vs. imprecise card marking, conditional vs. non-conditional

The typical combos in commercial server JVMs

Young generation usually uses a copying collector
Young generation is usually monolithic (STW)
Old generation usually uses Mark/Sweep/Compact
Old generation may be STW, or Concurrent, or mostly-Concurrent, or Incremental-STW, or mostly-Incremental-STW

Useful terms for discussing garbage collection

Mutator
- Your program…
Parallel
- Can use multiple CPUs
Concurrent
- Runs concurrently with program
Pause
- A time duration in which the mutator is not running any code
Stop-The-World (STW)
- Something that is done in a pause
Monolithic Stop-The-World
- Something that must be done in it’s entirety in a single pause
- Note: This is usually the noticeable pause people see in apps.
Generational
- Collects young objects and long lived objects separately.
Promotion
- Allocation into old generation
Marking
- Finding all live objects
Sweeping
- Locating the dead objects
Compaction
- Defragments heap
- Moves objects in memory
- Remaps all affected references
- Frees contiguous memory regions

Useful metrics for discussing garbage collection

Heap population (aka Live set)
- How much of your heap is alive
Allocation rate
- How fast you allocate
Mutation rate
- How fast your program updates references in memory
Heap Shape
- The shape of the live set object graph
- Hard to quantify as a metric…
Object Lifetime
- How long objects live
Cycle time
- How long it takes the collector to free up memory
Marking time
- How long it takes the collector to find all live objects
Sweep time
- How long it takes to locate dead objects
- Relevant for Mark-Sweep
Compaction time
- How long it takes to free up memory by relocating objects
- Relevant for Mark-Compact

Empty memory and CPU/throughput

Heap Size vs. GC CPU

Two Intuitive limits

If we had exactly 1 byte of empty memory at all times, the collector would have to work “very hard”, and GC would take 100% of the CPU time
If we had infinite empty memory, we would never have to collect, and GC would take 0% of the CPU time
CPU% arising from GC cycles decays exponentially as you increase memory.

Empty memory needs (empty memory == CPU power)

The amount of empty memory in the heap is the dominant factor controlling the amount of GC work
For both Copy and Mark/Compact collectors, the amount of work per cycle is linear to live set
The amount of memory recovered per cycle is equal to the amount of unused memory (heap size) – (live set).
The collector has to perform a GC cycle when the empty memory runs out
A Copy or Mark/Compact collector’s efficiency doubles with every doubling of the empty memory.

What empty memory controls

Empty memory controls efficiency (amount of collector work needed per amount of application work performed).
Empty memory controls the frequency of pauses (if the collector performs any GTW operations).
Empty memory DOES NOT control pause times (only their frequency).
In Mark/Sweep/Compact collectors that pause for sweeping, more empty memory means less frequent but LARGER pauses.
This is bad for web services that need to be highly available.

Concurrent Marking

Mark all reachable objects as “live”, but object graph is “mutating” under us.
Classic concurrent marking race: mutator may move reference that has not yet been seen by the marker into an object that has already been visited.
- If not intercepted or prevented in some way, will corrupt the heap.
Example technique: track mutations, multi-pass marking
- Track reference mutations during mark (e.g. in card table)
- Revisit all mutated references (and track new mutations)
- When set is “small enough”, do a STW catch up (mostly concurrent)
Note: work grows with mutation rate, may fail to finish

Incremental Compaction

Track cross-region remembered sets (which region points to which)
To compact a single region, only need to scan regions that point into it to remap all potential references
Identify region sets that fit in limited time
- Each such set of regions is a STW increment
- Safe to run application between (but not within) increments
- Incremental compaction is sensitive to the popular objects (i.e. if a region contains a popular object, referenced by many regions, we won’t be able to find many regions that are only pointed to by a small number of other regions).
Note: work can grow quadratically relative to heap size.
- The number of regions pointing into a single region is generally linear to the heap size (the number of regions in the heap)

Delaying the inevitable

Some form of copying/compaction is inevitable in practice
- And compacting anything requires scanning/fixing all references to it
Delay tactics focus on getting “easy empty space” first
- This is the focus for the vast majority of GC tuning
Most objects die young (Generational)
- So collect young objects only, as much as possible. Hope for short STW pause that occurs infrequently.
- But eventually, some old dead objects must be reclaimed.
Most old dead space can be reclaimed without moving it
- Track dead space in lists, and reuse it in place (e.g. CMS)
- But eventually, space gets fragmented and needs to be moved/defragged
Much of the heap is not “popular” (e.g. G1, “Balanced”)
- A non-popular region will only be pointed to from a small percentage of the heap.
- So compact non-popular regions occur in short STW pauses
- But eventually popular objects and regions need to be compacted.

Classifying Common Collectors

The typical combos in commercial server JVMs

Young generation usually uses a copying collector
- Young generation GC is usually monolithic, STW GC.
Old generation usually uses a Mark/Sweep/Compact collector.
- Old generation may be STW or Concurrent or mostly-Concurrent or Incremental STW or mostly-Incremental STW

Hotspot ParallelGC Collector mechanism classification

Monolithic STW copying NewGen
Monolithic STW Mark/Sweep/Compact OldGen

HotSpot ConcMarkSweepGC (aka CMS) Collector mechanism classification

Monolithic STW copying NewGen (ParNew)
Mostly Concurrent, non-compacting OldGen (CMS)
- Mostly concurrent marking
  - Mark concurrently while mutator is running (previously discussed multi-pass technique)
  - Track mutations in card marks
  - Revisit mutated cards (repeat as needed)
  - STW to catch up on mutations, ref processing, etc.
- Concurrent Sweeping
- Does not Compact (maintains free list, does not move objects)
Fallback to Full Collection (Monolithic STW)
- Used for compaction, etc.
- When you see promotion failure message or “concurrent mode failure” it means that CMS NewGen collector can’t push the object from NewGen to OldGen because OldGen doesn’t have a Swiss Cheese sized hole big enough to accommodate this new object so it falls back to full STW pause to clean up OldGen.

HotSpot G1GC (aka “Garbage First”) Collector mechanism classification

Monolithic STW copying NewGen
Mostly concurrently, OldGen marker (slightly different than CMS)
- Mostly concurrent marking
  - STW to catch up on mutations, reference processing, etc.
- Tracks inter-region relationships in remembered sets
STW mostly incremental compacting OldGen (instead of trying to be concurrent like CMS)
- Objective: “Avoid, as much as possible, having a Full GC…”
- Compact sets of regions that can be scanned in limited time
- Delay compaction of popular objects, popular regions
Fallback to Full Collection (Monolithic STW)
- Used for compacting popular objects, popular regions, etc.

The “Application Memory Wall”

Memory Use

Why are we all using between 1GB and 4GB?

Reality check: servers in 2012

Retail prices, major web server store (USD$, Oct. 2012)
- 24 vCore, 128GB server ~= $5k (most of us use less than half of this server)
- 24 vCore, 256GB server ~= $8k
- 32 vCore, 384GB server ~= $14k
- 48 vCore, 512GB server ~= $19k
- 64 vCore, 1TB server ~= $36k
- Cheap (< $1/GB/month), and roughly linear ~1TB
10s to 100s of GB/sec of memory bandwidth
If no apps need that much memory, nobody will build the servers.

The Application Memory Wall A simple observation:

Application instances appear to be unable to make effective use of modern server memory capacities. * The size of application instances as a % of a server’s capacity is rapidly dropping.
This is happening at approx. Moore’s Law rates.

How much memory do applications need?

Famous quote (that Bill G. didn’t actually say) “640KB ought to be enough for anybody”.
- WRONG!
So what’s the right number?
- 6.4MB?
- 64MB?
- 640MB?
- 6.4GB?
- 64GB?
There is no right number
Target moves at 50x-100x per decade

“Tiny” application history

“Tiny” means that it’s silly to break it into pieces

What is causing the Application Memory Wall?

Garbage Collection is a clear and dominant cause
There seems to be practical heap size limits for applications with responsiveness requirements.
- We build apps that respond to users, we can’t wait for a 3min long STW wait time.
- Certainly with ms SLA web services, we need to keep heap size small.
[Virtually] All current commercial JVMs will exhibit a multi-second pause on a normally utilized 2-6GB heap.
- It’s a question of “When” and “How often”, not “If”.
- GC tuning only moves the “when” and the “how often” around.
Root cause: The link between scale and responsiveness.

What quality of GC is responsible for the Application Memory Wall?

It is NOT about overhead or efficiency:
- CPU utilization, bottlenecks, memory consumption and utilization
It is NOT about speed
- Average speeds, 90%, 95% speeds, are all perfectly fine
It is NOT about minor GC events (right now)
- GC events in the 10s of msec are usually tolerable for most apps
It is NOT about the frequency of very large pauses
It is ALL about the worst observable pause behavior
- People avoid building/deploying visibly broken systems

GC Problems

Framing the discussion: Garbage Collection at modern server scales

Modern servers have hundreds of GB of memory
Each modern x86 core (when actually used) produces garbage at a rate of 0.25 to 0.5+ GB/sec
That’s many GB/sec of allocation in a server
Monolithic STW operations are the cause of the current Application Memory Wall
- Even if they are done “only a few times a day”

One way to deal with Monolithic-STW GC

Stick your head in the sand.

Another way to cope: Use “Creative Language”

“Guarantee a worst case of X msec, 99% of the time”
“Mostly” Concurrent, “Mostly” Incremental
- Translation: “Will at times exhibit long monolithic STW pauses”
“Fairly Consistent”
- Translation: “Will sometimes show results well outside this range”
“Typical pauses in the tens of msecs”
- Translation: “Some pauses are much longer than tens of msecs”

Actually measuring things: jHiccup

How can we break through the Application Memory Wall?

We need to solve the right problems

Focus on the causes of the Application Memory Wall
- Scale is artificially limited by responsiveness
Responsiveness must be unlinked from scale:
- Heap size, Live Set size, Allocations rate, Mutation rate
- Responsiveness must be continually sustainable
- Can’t ignore “rare” events
Eliminate all STW Fallback
- At modern server scales, any STW fall back is a failure.

The things that seem “hard” to do in GC

Robust concurrent marking
- References keep changing
- Multi-pass marking is sensitive to mutation rate
- Weak, soft, final references are “hard” to deal with concurrently.
[Concurrent] Compaction…
- It’s not the moving of the objects…
- It’s the fixing of all those references that point to them
- How do you deal with a mutator looking at a stale reference?
- If you can’t, then remapping is a [monolithic] STW operation
Young Generation collection at scale
- Young Generation collection is generally monolithic STW
- Young Generation pauses are only small because heaps are tiny
- A 100GB heap will regularly have GB of live young stuff

The problems that need solving (areas where the state of the art needs improvement)

Robust Concurrent Collecting
- In the presence of high mutation and allocation rates
- Cover modern runtime semantics (e.g. weak refs, lock deflation)
Compaction that is not monolithic STW
- e.g. stay responsive while compacting 0.25TB heaps
- Must be robust: not just a tactic to delay STW compaction
- Current “incremental STW” attempts fall short on robustness
Young-Geo that is not monolithic STW
- Stay responsive while promoting multi-GB data spikes
- Concurrent or “incremental STW” may both be ok
- Surprisingly little work done in this specific area

Azul’s “C4” Collector Continuously Concurrent Compacting Collector

Concurrent guaranteed-single-pass marker
- Oblivious to mutation rate
- Concurrent ref (weak, soft, finalizer) processing
Concurrent Compactor
- Objects moved without stopping mutator
- References remapped without stopping mutator
- Can relocate entire generation (New, Old) in every GC cycle
Concurrent compacting old generation
Concurrent compacting new generation
No STW fallback
- Always compacts and does so concurrently

C4 Algorithm Highlights

Same core mechanism used for both generations
- Concurrent Mark-Compact
A loaded value barrier (LVB) is central to the algorithm
- Every heap reference is verified as “sane” when loaded
- “Non-sane” refs are caught and fixed in a self-healing barrier
Refs that have not yet been “marked through” are caught
- Guaranteed single pass concurrent marker
Refs that point to relocated objects are caught
- Lazily (and concurrently) remap refs, no hurry
- Relocation and remapping are both concurrent
Uses “quick release” to recycle memory
- Forwarding information is kept outside of object pages
- Physical memory released immediately upon relocation
- “Hand-over hand” compaction without requiring empty memory

Sample responsiveness behavior

Not seeing efficiency, just response time
You want a nice flat plateau
SpecJBB + Slow churning 2GB LRU Cache
Live set is ~2.5GB across all measurements
Allocation rate is ~1.2GB/sec across all measurements

Zing

A JVM for Linux/x86 servers
ELIMINATES Garbage Collection as a concern for enterprise applications
Decouples scale metrics from response time concerns
- Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.
Leverages elastic memory for resilient operation

Sustainable Throughput: The throughput achieved while safely maintaining service levels

Key Takeaway: Because Zing/C4 is truly concurrent and compacting, the cap on heap size (i.e. application memory wall) large crumbles, enabling us to make heap size a greater ratio of total memory available for a host.

Instance capacity test: “Fat Portal” – HotSpot CMS: Peaks at ~3GB / 45 concurrent users

Fat Portal Hotspot

Instance capacity test: “Fat Portal” C4: still smooth @800 concurrent users

Used 50GB heap, no need to way for multi-second STW pauses
Zing is just far faster

Java GC tuning is “hard”

also prone to degrading performance as future versions take over previous versions

The complete guide to Zing GC tuning

java -Xmx40g

What you can expect (from Zing) in the low latency world

Assuming individual transaction work is “short” (on the order of 1msec), and assuming you don’t have hundreds of runnable threads competing for 10 cores…
“Easily” get your application to <10msec worst case
With some tuning, 2-3 msec worst case
Can go to below 1msec worst case…
- May require heavy tuning/tweaking
- Mileage WILL vary

Tuned Oracle HotSpot (Highly Tuned to use only NewGen collector restarting JVM every night so that there is no major GC) vs. Zing

Hotspot vs. Zing

Azul initiative to Support Open Source Community

Free access to Zing JVM for Open Source developers and projects
For use in development, qualification, and testing
For x86 servers running Red Hat Enterprise Linux, SUSE Linux Enterprise Server, CentOS and Ubuntu Linux

Virtual UPSs in the Cloud… Are they actually a real thing?

As I’ve stated in a previous post, I’m currently reading “Cloud Computing: Concepts, Technology & Architecture” by Thomas Erl. It’s a decent read, but one part had me puzzled in Sec. 5.4: Virtualization Technology:

Virtualization is the process of converting a physical IT resource into a virtual IT resource.

Most types of IT resources can be virtualized, including:

…

Power – A physical UPS and power distribution units can be abstracted into what are commonly referred to as virtual UPSs.

I’ve honestly never heard of a “Virtual UPS” before. I tried a bunch of a Google searches but to no avail. There’s nothing about such things in Amazon’s AWS docs, either, that I could find, or in its console for that matter. All I could find were red herring references to details things like a type of UPS that could make sure that your VMs were kept up and running. All in all, though, I don’t know why the author put this in there. Sadly, that was the most relevant search result. It’s certainly not common, as I found from several variants of Google searches to filter out everything related to the United Postal Service and Universal Power Supplies (you know, those adapters you use when you travel to Europe to charge your phone, not the Uninterruptable Power Supplies that ensure that momentary brownouts don’t take out your servers and the first desktop you assembled back in high school).

Anyways, the key premise of this brief post is that technologies such as Virtual UPSs really aren’t useful in the context of the cloud. You have a virtual host or a container; then it goes down. Such scenarios can happen because of your shitty code, the web server code, the virtualization software layer, itself, random transient errors, etc. Bottom line, who cares if you can spin up another instance that works. One of the big benefits of the cloud is that the cloud enables us to think about hosts as a commodity that can go down from transient issues intermittently but can be brought back up again quickly and automagically by provisioning a new instance. At such a layer of abstraction, low-level notions like a virtual uninterruptable power supply really don’t have any use or value in this new paradigm, because these lower level notions simply don’t have any meaning in the cloud.

Practical steps to achieve independent deployments between Java web services

Recently, I’ve been reading a new book called Building Microservices: Designing Fine-Grained Systems, by Sam Newman. It’s a good book so far and this is my second article pertaining to the book. One of the things that the author discusses on page 30 is the importance of loose coupling between microservices.

When services are loosely coupled, a change to one service should not require a change to another. The whole point of a microservice is being able to make a change to one service and deploy it, without needing to change any other part of the system. This is really quite important.

IMO, the author’s statement is perfectly valid. But, in this high level book, the author doesn’t get into practical code level recommendations. As such, I thought that I would discuss some practical tactics for achieving truly independent deployments for the Java RESTful JSON microservice world.

Client-Side Independence: Client side independence happens when you’re working on a client that calls the Java service and you can add new data to the service’s JSON request payload without actually changing anything on the server side beforehand. This is useful, especially when the two codebases involved are owned by two different teams. Perhaps the server side coders aren’t ready with their implementation yet. In order to achieve this type of independence, you really want to ensure that the JSON POJOs being deserialized on the server side make use of Jackson’s @JsonIgnoreProperties annotation (or the equivalent of whatever non-Jackson JSON library you are using).
Server-Side Independence: Server-Side independence is the complement of client-side independence. It happens when you’re working on a service that gets called by a client and you can add new functionality without worrying about the client causing exceptions by not sending the most up-to-date data. Like client-side independence, this is especially useful when the two codebases involved are owned by two different teams. In order to achieve this degree of freedom and loose coupling, you need to be able to gracefully handle null values in your requests. The client doesn’t send the new field? No problem, just ignore and move on.
Don’t Change Existing Inputs, Just Add New Ones: This recommendation is a corollary to #1 and #2. One of the cornerstones of achieving loose coupling is to ensure that you never change the meaning or type of existing inputs. Rather, you only add new inputs. Want to change the meaning of an existing variable in the JSON request? Just deprecate it and create a new variable. Alternatively, create a new version of your API, eventually ignoring the old variable completely and supporting both in the interim. Doing so doesn’t force your clients to change all at once. In fact, Unicode uses this principle to great effect. The Unicode consortium has guaranteed that none of the characters for any given number will change – ever. This policy is necessary because it means that you can always have confidence that when you code an a, that it will always mean an a, and not change meaning to a b when Trump decides to establish Minitrue. Similarly, your clients can have confidence that you will never change the meaning of their data they send to you unless they call a new versioned endpoint.
Flags to Turn On and Off Functionality: Lastly, as a fallback, you should consider adding a configuration flag that can enable and disable the new functionality. Ideally, with proper testing you shouldn’t need this. But, you should consider making use of such flags because they are fallbacks that can be readily used should you find an issue with your new functionality. Rollbacks are useful, but suppose you want to disable your functionality after another change has already been rolled out, or pushed to prod at the same time your own change went out. Unforeseeable issues make functionality flags a sensible tactic for achieving independent deployments between services, augmenting your systems’ overall robustness.

I hope that you, like me, can make use of these tactics for achieving independent deployments between services. Happy coding.

If the likes of Anthony Bourdain and Kate Spade are susceptible to suicide, what hope do the rest of us have in finding personal satisfaction via our careers?

Last week I woke up to the news that Anthony Bourdain had committed suicide in his hotel room in France. I really enjoyed Bourdain’s work, and consider myself a fan. The sheer volume of media attention surrounding his death attests to the tremendous impact he had on peoples’ lives. When I talked to my friend, Dany Houde, about Bourdain’s death, Dany discussed how Bourdain lived “the life” and how he was questioning his own choices since he thought so highly of Bourdain’s lifestyle. I agree with much of what Dany had to say. Like Dany’s choices, my own choices in travel destinations and desire to experience amazing cuisines on my journeys are heavily influenced by Bourdain’s shows.

Similarly, a few days before Bourdain’s suicide was news, the news of the day was that Kate Spade had also committed suicide. Admittedly, I wasn’t as big a fan of Kate Spade as I was of Anthony Bourdain. Being a high-end women’s fashion designer, her impact on me was less than that of the bad-boy-globe-trotting-gourmand image projected by Bourdain. Nonetheless, I do respect her work. Like Bourdain, the maelstrom of media coverage surrounding Spade’s death is clear evidence of her tremendous impact on society, and of the fact that she, like Bourdain, had reached the apex of professional achievement.

One often associates professional success with personal satisfaction. I recall Brian Gill, one of the leaders that I have most respected in my career, stating that in his opinion, you should only work at a job that you love. It’s wise advice, and it’s advice that I strive to achieve. For me, the implication of Brian’s advice is that by having a career that you love, your personal sense of satisfaction with life overall is enhanced. After all, since work forms such a substantial portion of our lives, satisfaction at work means a greater percentage of the time overall that we feel satisfied.

But holy shit, man. Surely Bourdain and Spade both loved their jobs. Sure I didn’t know them personally, but I just don’t see how you can get to that level of success without loving (and I mean really loving) what you do. For an aspiring fashion designer, you really can’t reasonably aspire to achieve anything more than Kate Spade did. And yet, she hung herself. For an aspiring chef, restauranteur, or twenty-first-century-upper-middle-class-globe-trotter, you really can’t aspire to achieve anything more than Anthony Bourdain did. And yet, he too hung himself.

But doesn’t doing what you love for a living naturally translate into personal satisfaction and fulfillment? Isn’t that the whole point, to be happy in the end? I get that money doesn’t buy happiness, but surely having a profession that you love translates into some measure of personal satisfaction in the larger sense. At the very least, shouldn’t doing what you love for a living lead to enough satisfaction that you shouldn’t want to kill yourself?

So then, what do I make of Bourdain and Spade? Was the fulfillment of success just not enough for them? Clearly, depression had a major influence. But, they functioned well enough to get where they are; the depression didn’t overwhelm them in their youths. Was their success just because they were fueled by manic energy, constantly trying to outrun their own demons and depression? Who knows…

But that’s just what makes their deaths so scary to me, that their success wasn’t enough in the end. Shouldn’t people who made it so far in life not want to kill themselves? If they, being so awesome, can’t stave off the crushing depression of their own minds, what hope does mundane little ole me have?

For me, bereft of answers, all I can fall back on is that we all require balance in our lives. Our careers, alone, aren’t sufficient to achieve personal fulfillment. There are a number of dimensions in which we all need satisfaction.

One list of categories in which one needs to develop and feel satisfied comes from the different sections in John Sonmez’s book – Soft Skills:

Career: Everybody has to pay the bills, but why not feel satisfied doing it? Everybody needs to grow and develop in their chosen profession.
Self-Marketing (aka Personal Branding): Building your own personal brand can give one a big leg up on the competition. This is one of the things that I’m trying to achieve by writing on this blog. But it’s also an outlet for you to express your thoughts and opinions and gain a reputation in this area. Note that for many, personal branding can be tightly coupled with one’s own career.
Learning: IMO, learning should be a lifelong thing. Whether it’s learning new technologies in my career or reading a new parenting book, learning and assimilating the hard-won lessons of others is key to making fewer mistakes and becoming wiser as one gets older.
Productivity (aka Getting Things Done): Everybody needs to accomplish daily tasks. It could be tasks at one’s job. It could be daily cleaning and cooking – daily administrivia and minutia and paying those bills in a timely fashion. Whatever it is, we all need to get shit done, and getting better and doing it and developing good habits is necessary. Getting such things done and out of the way helps cultivate a sense of satisfaction and confidence in life.
Finances: We need to increase our degree of financial freedom as we get older. Ideally, we want to have a decent retirement and live comfortably during our golden years.
Fitness: Everybody needs to be healthy. When you aren’t healthy just about every other aspect of your life is dragged down. We all need to be as healthy as we can be.
Spirit: This dimension isn’t so much about picking a faith as it is about changing your own brain to get what you want out of life. Atheism doesn’t exclude one from growing in this dimension. Examples of growing in this category include improving one’s self-esteem and self-image. But if you are spiritual, then doing something like cultivating your relationship with God would fall under this category, since by believing that you are becoming closer with God, you are changing your own brain to feel better about yourself and more comfortable with yourself.

In addition to the above categories, I would add two more to the list:

Family: Everybody has a family of some sort. Investing in our relationships with our kids and our family helps one live a more fulfilling life.
Volunteering: Everybody has gotten something from their community. At the very least, we have all been lucky in some capacity. As such, we should help give a little back and improve the lives of others. Doing so gives one a greater sense of personal fulfillment.

So to come full circle, where do these categories leave me on what to think about Bourdain and Spade’s suicides? They clearly had mastered the categories of Career, Self Marketing, Learning, Productivity, and Finances. Both appeared to be moderately healthy, so they had to have had a least some measure of success with the Fitness category. They both seemed to do their share of volunteering – check for that category. But what about Spirit, Family? Of course, I have no idea. I don’t even know if improving in these categories would have helped them. I haven’t suffered through depression to the point where I felt suicidal, and I hope I never do. All I can say is that for me, personally, each of these categories contributes substantially to my sense of satisfaction and happiness. Focussing on my career, alone, doesn’t augment my long-term happiness. Instead, by focussing on my career in concert with these other areas, my overall happiness and satisfaction with life is increased. I hope that continues to remain so and I hope that each of these categories helps you in achieving the same.

Why Some Companies Tend Toward Microservices and Why Some Companies Don’t

Recently, I started reading a new book called Building Microservices: Designing Fine-Grained Systems, by Sam Newman. So far, the book is good. The first chapter gave a good overview of what microservices are. One of the key takeaways are about how you want to keep services small, with a core benefit being that you can iterate faster than you otherwise would have. However, in my experience, I have found that I (and others) can tend towards the reverse. That is, engineers might tend towards adding new code to existing monolithic services instead of adding code to new services. Funnily enough, we agree with the theory of microservices, but disagree with it in practice from time to time. It’s curious because if microservices are so much simpler amd easier to maintain than their monolithic counterparts, and allow engineers to be more productive, then why don’t they simply take over like weeds? Are many of my colleagues and I just shitty engineers or is there something deeper going on? At some companies and environments microservices do proliferate and in others they don’t. This is especially curious because people often follow the path of least resistance – they want to get their job done and move on.

In my experience, what I’ve found is that people tend to microservices only when the process of spinning up a new microservice is easy. If that’s the case, then it’s easier to create a new microservice for a new piece of functionality. Otherwise, it’s easier and faster and cheaper to simply continue tacking on new functionality to an existing monolithic application until it becomes zombieware, at which point you still add on new code because nobody wants to deal with that undead elephant of a problem.

As an extremely astute colleague and friend of mine, Henry Seurer, once said: “The app is the deployment pipeline.” By saying that, Henry was basically saying that engineers just leverage the existing app to create new functionality because it’s the easiest thing to do. Thus, in order for organizations to have a thriving microservice ecosystem, they must enable engineers to easily provision hosts and deploy services/containers automatically using easy-to-create pipelines.

Why Should I Use Containers?

So I had my first skip-one meeting today (i.e. my first one-on-one meeting with my boss’ boss) for my current team. It was interesting. My director is a sharp guy – I like him, and I feel like he’s trustworthy. One of the things that I found interesting was how he tried to get to know me better by probing me on fundamentals. One question in particular that I found interesting was: Why should we continue deploying our web services using Docker containers? It was the first time in quite some time that I thought about this question, and I’ll list the two main reasons that I gave my director. They are the reasons that I think are the most critical for large scale enterprise environments.

Containers are a lighter weight form of virtualization. Virtualization allows one to run multiple services on a single host. This is useful because if your host isn’t precisely calibrated to fit your service, your hardware resources become idle. For example, an idle CPU, unused RAM, empty disk space, and an idle NIC and all consequences of undersized applications running on commodity hardware that isn’t precisely chosen for the application it runs. And of course, in our world of data centers and clouds, we don’t want to custom build hosts, we want them to be commoditized for optimal economics. By using virtualization, you can have multiple services that make better use of your hardware’s resources. But, virtualization isn’t perfect. Virtualization would be perfect if it didn’t come with it’s own overhead. Virtualization, itself, requires additional CPU cycles, more MBs used by RAM, more HDD pages, etc. This cost can be considerable because you’re basically emulating an entire host and operating system; this cost can be enormous when you extrapolate it to an entire data center or even larger regions and WANs. The diagram below from a nice ZDNet article on the subject shows how Docker is a lighter weight form of VMs in some detail. Containers reduce the cost of the overhead that comes with traditional forms of virtualization.
Containers let you deploy your application with greater automation and reliability. So the aforementioned ZDNet article discusses how containers lend themselves more to CICD. The reason they do so is because it’s easier to deploy your service on not just commoditized hardware, but to leverage PaaS more. For example, suppose you have a Java 8 RESTful web service running on Tomcat 7. You don’t need a complicated automated deployment pipeline that deploys Java 8, Tomcat 7, and the WAR file (or worst yet get an ops team to manually deploy Tomcat 7 and Java 8 and only use a CICD pipeline to deploy your WAR file). Instead, you can just deploy a single container that contains all of Java 8, Tomcat 7, and your web service’s WAR file as a single deployment artifact. Since you’re just deploying a single container, everything going out as the product of what the pipeline pushes, you can incorporate tests to ensure everything works as a whole, leaving less of a need than ever before for shitty manual deployment processes.

Anyways, that’s it, I hope that these two core reasons benefit you. They sure helped me. 🙂