A Collection of Papers on "Cloud Big Data Systems"

dking · February 15, 2018, 3:37pm

When I was young warthog, I took a class on engineering distributed systems for processing large amounts of data. A lot of the earlier papers focused on consensus algorithms (which we discussed in Hail Lab Meeting several weeks ago), but it also included some papers about MapReduce and other work on the data processing systems themselves (rather than the underlying consensus systems).

The “Cloud Big Data Systems” class webpage contains links to all the papers. A couple ones that might be fun to talk about:

“MapReduce: A major step backwards” in which Michael Stonebreaker [1] laments the hype around MapReduce
“A Comparison of Approaches to Large Scale Data Analysis” in which Pavlo, et al. compare Hadoop, Vertica, and an anonymous DBMS [2]. As you may expect, Hadoop doesn’t exactly shine.

[1] Stonebreaker created SciDB, which attempts to tackle similar problems to Hail, though the cited use-cases seem to focus on Physics and Astrophyiscs rather than Genetics.
[2] There’s a particular company that was/is famous for not letting researchers use their name in publications that benchmark their system. I think that company is Oracle, but I’m not certain.

Topic		Replies	Views
Tell Hail to make different use of Spark processing	2	707	January 22, 2019
Lifting over methods from 0.1 to 0.2	12	952	January 29, 2018
Proposal: Shuffler (Attempt 2)	0	592	March 20, 2020
A Hardware/Software Architecture for Petabyte Datasets	2	1124	October 9, 2018
Hail on Amazon Web Services	0	617	November 30, 2020

A Collection of Papers on "Cloud Big Data Systems"

Related topics