Background

Code42 CrashPlan is a backup service that promises unlimited backup for a fixed monthly fee. I have been using them since 2010 I think. It works, but performance has always been an issue – CPU usage was high, memory consumption was high, and upload speed was fairly abysmal. Support was referring me to this article and the following quote specifically:

Code42 app users can expect to back up about 10 GB of information per day on average if the user’s computer is powered on and not in standby mode.

I was getting my upload rate consistently capped at 132kB/sec which coincidentally translates to 10GB/day. At that point I concluded that perhaps they are throttling me, unsurprisingly, since nothing unlimited can be expected to be offered at a flat rate, there have to be limitations, and having that support article in front of me was convincing enough for me to conclude that this was by design.

This of course was unacceptable for my use cases so I have since migrated to another solution – duplicacy backup software coupled with third party cloud block storage provider as a destination and was fairly happy with the performance.

Recently however it was pointed to me that my upload issues with CrashPlan could have likely been caused not by Code42 bandwidth management on the server side but rather by local backup engine chocking when trying to deduplicate massive amount of data. Massive in this case was still under 1TB, but apparently that was sufficient for the java based backup engine to peg Atom C3000 processor.

Solution

I have attempted to do the following unsuccessfully:

  • Disable compression on each backup set.
  • Set an upper limit on the file size that is subject to deduplication
  • Set deduplication mode to AUTOMATIC or MINIMAL

I eventually found that the only way to drop CPU usage and fix the upload rate was to disable deduplication altogether:

sed -i "s/<dataDeDupAutoMaxFileSizeForWan>[0-9]*<\/dataDeDupAutoMaxFileSizeForWan>/<dataDeDupAutoMaxFileSizeForWan>1<\/dataDeDupAutoMaxFileSizeForWan>/g" my.service.xml

The dataDeDupAutoMaxFileSizeForWan filters files subject to deduplication when backing up to WAN destinations by maximum size. The default value of 0 means “no limit”. Changing this value to the smallest positive size (1 byte) effectively disables deduplication.

You would need to stop CrashPlan engine, update the xml config file, and then start the engine; otherwise your changes won’t persist as CrashPlan writes out the configuration on shutdown.

Caveats

Code42 warns about potential implication of disabling deduplication in this article here. Depending on the type of data it may be counterproductive to do so – for example, for large files only portions of which change frequently, such as VM images, the savings of time due to reduced upload size could more than offset performance loss due to deduplication. The impact and benefits of this change should be evaluated and measured for specific backup set before committing.

For most home users however (who mostly backup photos. videos, and other media; in other words data that due to its nature is unique and incompressible) deduplication provides no benefits and can be safely disabled yielding net gain in backup performance. These users may also benefit from disabling compression selectively (or even altogether) for some or all backup sets, however I haven’t noticed any significant impact of this change on my own dataset.

sed -i "s/<compression>ON<\/compression>/<compression>AUTOMATIC<\/compression>/g" my.service.xml

Final words

Code42 should seriously consider substantially overhauling their backup client software as this sort of performance is not acceptable. The impact of this will only become worse as the backup set grows over time. It does however provide them with a “natural” throttling mechanism, so I can see why they would be hesitant to optimize the performance. I’m just speculating here, but apparently the cost of additional bandwidth consumed by the few users that bother to disable deduplication is more than offset by savings on storage utilization due to effective rate limiting all other customers, punishing the users proportionally to the amount of data they have. It’s pretty slick if you think about it.