Hardware+Software Architecture for Huge Datasets
Richard Cownie, Hail Team, 2018-09-16
Introduction
We analyze the cost of storage and speed of access to 1 petabyte
of data in a conventional cloud-storage cloud-compute environment,
and compare to a converged storage+compute solution using
off-the-shelf hardware.
Cloud storage
Google cloud storage costs $0.026/GB-month
Azure object storage costs $0.017/GB-month (over 500TB)
AWS S3 storage $0.021/GB-month (over 500TB)
So storage of 1PB = 1000000GB = $17-26K/month
This is a baseline figure, and presumably there would be discounts
for high volume and non-profit use.
Access latency to object storage is usually bad, e.g. > 50msec.
Typical bandwidth is 100MB/sec to each compute instance. Let’s
suppose we want to scan 10TB of data using 100 instances.
100 instances give 100*100 = 10000MB/sec, or 0.01TB/sec, so
the scan will take around 1000sec.
Converged storage+compute
Consider a cluster of 12 compute+storage nodes. The hardware
of each node consists of:
Dual-Xeon 2 x 12 cores (24 cores / 48 threads)
128GB DRAM
12 x 10TB HDDs (SATA or SAS)
Dual 10Gbit Ethernet (or other cluster interconnect)
Since we’re dealing with huge datasets, but those huge datasets
don’t change very often, it is more cost-effective to use
erasure-coding (e.g. Reed-Solomon) rather than replication.
Based on disk reliability figures published by backblaze.org,
we estimate that 9+3 erasure coding and quarterly replacement
of failed drives can reduce the probability of data loss to
below 100 years.
So each 9+3 redundancy set would be spread across 12 HDDs with
1 HDD in each node. And we have 6 sets, giving usable storage
capacity of 6 sets * 9 HDDs * 10TB = 540TB.
Then two of these clusters, or 24 nodes, can provide 1080TB.
This can be implemented with 2 nodes in 2U rack units, thinkmate.com
quotes $18122 including 3-year onsite maintenance (see attached
configuration). So the complete 1080TB-usable store would cost about $217K,
and could fit in 12 x 2U = 24U, about 60% of a 42U rack, plus some
space for top-of-rack switches (e.g. 2 x 2O). Or you could fit
3 x 540TB = 1620TB usable in a full rack (36U + switches). Or go up
to 12GB HDDs and get 1944TB usable in a full rack.
Each node take about 400W, so total power is about 24800W = 9.6KW,
so monthly power cost ~ 3024*9.6 * 0.15 = $1037.
Access latency is ~ 8msec, 6-10x faster than object storage, which
offers better opportunities to skip unwanted data (though nowhere hear
as good as flash SSDs).
Bandwidth per HDD is about 230MB/sec. The check-blocks can be spread
across drives as in RAID5, so the full bandwidth of all 246 drives
is available for data accesses (for large enough datasets), giving
246*230 = 33.1GB/sec, 2TB/minute, or 119TB/hour. So a scan of 10TB will take
302sec, 3.3x faster than the example above - but with no added cost for
compute instances.
In typical use, I would expect that the permanent compute+storage
cluster would execute low-level scans, filters, aggs, and simple transforms,
producing a relative small processed dataset which could be sent to
on-demand instances more suitable for linear algebra or ML (e.g. with GPGPUs).
The converged storage+compute cluster is also suitable for use with
scheduled scans of a huge dataset. A scan of a 100TB dataset takes 50 minutes,
so this could be scheduled twice a day using 7% of bandwidth.
Cost comparison
We assume 50% discount on cost of Google object storage, giving
$13K/month cost.
Assume that in addition to the compute_storage cluster, we also keep a
copy of data in coldline storage at $0.001/GB, or $1K/month. So
running cost is about $1K/month for power, + $1K/month for coldline
storage.
So the compute+storage saves about $11K/month, and breaks even at
$217K/11K = 20 months of usage.
Conclusion
The system configuration presented here is not necessarily optimal, but
it is one datapoint to demonstrate the advantages of a converged
storage+compute architecture over conventional cloud storage. For storage
and analysis of petabyte-scala datasets with a write-once read-many workload,
a converged storage+compute cluster can offer higher absolute performance
(exploiting the high bandwidth of direct-attached HDDs) at lower cost
(probably because cloud object storage is based on replication, which is
not cost-effective for rarely-updated petabyte-scala datasets).