A Hardware/Software Architecture for Petabyte Datasets

rcownie · September 20, 2018, 6:27pm

Hardware+Software Architecture for Huge Datasets

Richard Cownie, Hail Team, 2018-09-16

Introduction

We analyze the cost of storage and speed of access to 1 petabyte
of data in a conventional cloud-storage cloud-compute environment,
and compare to a converged storage+compute solution using
off-the-shelf hardware.

Cloud storage

Google cloud storage costs $0.026/GB-month
Azure object storage costs $0.017/GB-month (over 500TB)
AWS S3 storage $0.021/GB-month (over 500TB)

So storage of 1PB = 1000000GB = $17-26K/month

This is a baseline figure, and presumably there would be discounts
for high volume and non-profit use.

Access latency to object storage is usually bad, e.g. > 50msec.
Typical bandwidth is 100MB/sec to each compute instance. Let’s
suppose we want to scan 10TB of data using 100 instances.
100 instances give 100*100 = 10000MB/sec, or 0.01TB/sec, so
the scan will take around 1000sec.

Converged storage+compute

Consider a cluster of 12 compute+storage nodes. The hardware
of each node consists of:

Dual-Xeon 2 x 12 cores (24 cores / 48 threads)
128GB DRAM
12 x 10TB HDDs (SATA or SAS)
Dual 10Gbit Ethernet (or other cluster interconnect)

Since we’re dealing with huge datasets, but those huge datasets
don’t change very often, it is more cost-effective to use
erasure-coding (e.g. Reed-Solomon) rather than replication.
Based on disk reliability figures published by backblaze.org,
we estimate that 9+3 erasure coding and quarterly replacement
of failed drives can reduce the probability of data loss to
below 100 years.

So each 9+3 redundancy set would be spread across 12 HDDs with
1 HDD in each node. And we have 6 sets, giving usable storage
capacity of 6 sets * 9 HDDs * 10TB = 540TB.

Then two of these clusters, or 24 nodes, can provide 1080TB.

This can be implemented with 2 nodes in 2U rack units, thinkmate.com
quotes $18122 including 3-year onsite maintenance (see attached
configuration). So the complete 1080TB-usable store would cost about $217K,
and could fit in 12 x 2U = 24U, about 60% of a 42U rack, plus some
space for top-of-rack switches (e.g. 2 x 2O). Or you could fit
3 x 540TB = 1620TB usable in a full rack (36U + switches). Or go up
to 12GB HDDs and get 1944TB usable in a full rack.

Each node take about 400W, so total power is about 24800W = 9.6KW,
so monthly power cost ~ 3024*9.6 * 0.15 = $1037.

Access latency is ~ 8msec, 6-10x faster than object storage, which
offers better opportunities to skip unwanted data (though nowhere hear
as good as flash SSDs).

Bandwidth per HDD is about 230MB/sec. The check-blocks can be spread
across drives as in RAID5, so the full bandwidth of all 246 drives
is available for data accesses (for large enough datasets), giving
246*230 = 33.1GB/sec, 2TB/minute, or 119TB/hour. So a scan of 10TB will take
302sec, 3.3x faster than the example above - but with no added cost for
compute instances.

In typical use, I would expect that the permanent compute+storage
cluster would execute low-level scans, filters, aggs, and simple transforms,
producing a relative small processed dataset which could be sent to
on-demand instances more suitable for linear algebra or ML (e.g. with GPGPUs).

The converged storage+compute cluster is also suitable for use with
scheduled scans of a huge dataset. A scan of a 100TB dataset takes 50 minutes,
so this could be scheduled twice a day using 7% of bandwidth.

Cost comparison

We assume 50% discount on cost of Google object storage, giving
$13K/month cost.

Assume that in addition to the compute_storage cluster, we also keep a
copy of data in coldline storage at $0.001/GB, or $1K/month. So
running cost is about $1K/month for power, + $1K/month for coldline
storage.

So the compute+storage saves about $11K/month, and breaks even at
$217K/11K = 20 months of usage.

Conclusion

The system configuration presented here is not necessarily optimal, but
it is one datapoint to demonstrate the advantages of a converged
storage+compute architecture over conventional cloud storage. For storage
and analysis of petabyte-scala datasets with a write-once read-many workload,
a converged storage+compute cluster can offer higher absolute performance
(exploiting the high bandwidth of direct-attached HDDs) at lower cost
(probably because cloud object storage is based on replication, which is
not cost-effective for rarely-updated petabyte-scala datasets).

jjfarrell · October 8, 2018, 6:55pm

I don’t see the attached config for the the $18,122 2 node 2U units. Could that be posted? Which 12 core Intel chip does this config have?

rcownie · October 9, 2018, 12:46am

I have it as a pdf, I’m not sure how to attach that here, email me at richard.cownie@pobox.com
and I’ll reply with an attachment.

The cpu’s I picked as an example were the Xeon E5-2650 v4 (12-core/24-thread 2.20GHz 105W)
The quote is from http://thinkmate.com, and the server chassis is the SuperServer 6028TR-DTR,
a 2U rack which gives 2 nodes each with 2 Xeon’s + 6 HDDs. Of the systems currently available
off the shelf, that seems like the best/densest way to get (what might be) a good balance of
storage capacity, aggregate storage bandwidth, and compute throughput.

Topic		Replies	Views
Proposal: Shuffler (Attempt 2)	0	557	March 20, 2020
RFC: Batch, Pipeline, CI roadmap	11	998	July 10, 2019
Proposal: Shuffler (Attempt 4)	2	649	August 26, 2020
Caching Service	4	628	August 10, 2020
Merging Multiple Scans	0	593	September 20, 2018