Homework Assignments¶

Homework Assignment 0¶

Please submit a one-pager self-intro to my email address, including the following information:

Your full name
Your GMU email address
Your department/major
Degree that you are working on (PhD/MS/BS)
Which year you are in
Your research area/topic and advisor (if you know)
Why you are taking this course
What you expect to learn from this course

Sign-up for Piazza over here.
Paper presentation sign-up Doodle link over here.
Sign-up for AWS Educate to get the $100 AWS credits.

Homework Assignment 1¶

Important

This Homework Assignment is worth 10%. Due: 11:59pm Sep 14. New: Deadline extended to 11:59pm Sep 16.

Introduction¶

This homework assignment will give you experience building and assembling microservices with integrated public cloud services. You will solve the problem about how to provide strong consistency on top of weakly consistent cloud storage.

In this assignment you will leverage an external Consistent Anchor (i.e., an etcd storage service) to enable strong consistency on the Eventually Consistent AWS S3 object store service. Essentially you will build a composite storage service, whose key property is that the composite storage service’s consistency is the same as the consistency anchor. etcd provides strong consistency, and hence ideally serves as a lock server as well as a metadata server (MDS).

Figure 1: An overview of the system architecture.

Figure 1 depicts the high-level architecture of the service that you will build in this assignment. The client library (whose major logic will be implemented by you) will interact with etcd (the consistency anchor) and AWS S3 to perform read and write operations. Realizing a key-value interface, we call read a Get and write a Put.

Get/Put Operations¶

Client library acquires lock over the key of the data object from etcd before proceeding to a Get or a Put operation. Put initiates a lock from etcd, acquires the lock, and then issues an S3 write to the S3 object store. AWS S3 is a versioned object store, which uses versioning to track update history. You will have to enable versioning at the bucket level. The client library captures the new version of the data object piggybacked by the ack returned by the S3 write, and updates the etcd metadata server. Put pseudo code works as follows:

Put(bucketName, key, obj):
    etcd.lock.acquire(bucketName+key)  # acquire the lock over the full path of data obj
    put_ack = s3.put_object(bucketName, key, obj)  # write to S3 and get put ack
    etcd.mds.write(bucketName+key, put_ack.VersionId)  # update etcd metadata server (MDS) with updated version id of the data obj
    etcd.lock.release(bucketName+key)  # release the lock

The rationale behind this is that, by keeping track of the object versions, application can always make sure whether the just-fetched data is the most updated (not stale). With this in mind, Get is also required to: 1. acquire the lock from etcd under the same data path (in our pseudo code example, we use concatenated bucketName+objKey as the lock ID); 2. fetch the latest version ID from etcd; and 3. perform S3 read until the fetched object matches the version ID saw in Step 2; 4. release the lock. Get pseudo code works as follows:

Get(bucketName, key):
    etcd.lock.acquire(bucketName+key)  # acquire the lock over the full path of data obj
    versionId = etcd.mds.read(bucketName+key)  # read the latest version ID of the obj
    while True:
        get_ack = s3.get_object(bucketName, key)  # keep fetching obj...
        if versionId == get_ack.VersionId:  # until its version ID matches
            break
    etcd.lock.release(bucketName+key)  # release the lock
    return get_ack.obj

Make sure you use:

try:
    '''
    ...proper code...
    '''
except Exception as e:
    logging.error(traceback.format_exc())

or something equivalent to handle exception cases, such as fetching a non-existing S3 object.

Besides Put and Get, you should implement CreateBucket API as well to create a bucket in your S3 instance. You cannot use S3 until you create a namespace specified using a bucket name. When creating a bucket, make sure to enable S3 versioning, as by default versioning is disabled at S3.

Collaboration Policy¶

You may pick a partner but it is fine if you want to work on this assignment individually. Once you’ve decided the team composition, you will have to stick with it and are not allowed to change partner or go back to individual mode for Homework Assignment 2. Please fill out this Google form for your team information.

You must write all the code you hand in, except for code that I give you as part of the assignment. You are not allowed to look at anyone else’s solution. You may discuss the assignments with other students, but you may not look at or copy each others’ code. Please do not publish your code on the Internet – for example, please do not make your code visible on GitHub.

Software Development¶

You will implement both Homework Assignment 1 and 2 in Python. But you are more than welcome to use any programming language that you like, given that you can demo that your implementation works. I supply you with a partial client library implementation (just the framework and some boring bits) and some tests.

You are required to code and test in a configurable Linux environment that you have sudo access with. You can use the VSE Linux machines (aka zeus.vse.gmu.edu), if you can freely install the required dependencies (Python-etcd library and AWS Boto SDK SDK). -OR- you can rent one or few small EC2 Linux VMs (e.g., t2.small with $0.023/hour or even the free-tier t2.micro with 750 hours usage), and do the development out there (EC2 to S3 traffic is free :-). $100 AWS credits are pretty sufficient for the homework usage.

Required Python (2.7) dependencies can be installed using pip:

pip install python-etcd
pip install boto3

When using Boto3 to access the S3 service, you will need to provide the corresponding S3 credentials files. Please make sure you are NOT sharing any sensitive credentials as PLAIN TEXT in your code. To configure a shared credentials file, create:

~/.aws/credentials

Below is a minimal example of the shared credentials file:

[default]
aws_access_key_id=foo
aws_secret_access_key=bar

Boto3 can also load credentials from:

~/.aws/config

An example of the config file please see below:

[default]
region=us-east-1
output=json

The above simple examples should be sufficient for this Homework Assignment. To learn how to do comprehensive configuration, access the official Boto3 documentation.

You can build etcd from scratch by following the instructions on GitHub. Here is an example of etcd.conf that you can refer to for instantiating a single-node etcd setup. Make sure you change all the urls related configuration options to reflect your own IP address setup. So far you don’t need to worry about the reliability and data persistency of our etcd service. As such, a non-replicated, single-node etcd setup suffices at this point. Run etcd by typing:

./bin/etcd --config-file etcd.conf

For the ease of development and testing, you can tear down an already running etcd deployment by simply killing the process, and re-run it for repetitive testing.

Getting Started¶

Download the s3kv wrapper code here. Your job is to add in your logic to finish the API implementation. For testing, you can modify the driver.py to read from/write to your S3 buckets. Or you can modify the little benchmarking tool s3kv_slap.py to perform single-thread or multi-threaded benchmarking against your S3KV implementation.

Download the benchmarking traces here.

You can always generate your own trace using YCSB (Yahoo! Cloud Serving Benchmark) and feed it into s3kv_slap.py for testing. First, install YCSB using Maven by typing:

mvn clean package

The whole YCSB build process may take tens of minutes. After that, you can compose your own workload configuration file under ycsb/workloads. See here for a list of workload properties. For your reference, here is a sample workload configuration file that I used to generate wl_small.load and wl_small.run.

For example, under the ycsb root dir, to generate load traces, type:

./bin/ycsb.sh load basic -P workloads/wl_small > workloads/wl_small.load

To generate run traces, type:

./bin/ycsb.sh run basic -P workloads/wl_small > workloads/wl_small.run

Hint

You should manually delete the unnecessary header and footer info (regarding the YCSB runtime statistics) within the generated .load and .run workload traces. Also note that you should always run load before testing the actual run. Load is to generate all the data objects that will be touched by the subsequent .run workload. .run workload contains the mix of reads and writes of interest.

Handin Procedure¶

Submit your code via Mason’s GitLab at

https://git.gmu.edu/users/sign_in.

You will need to log in to GitLab using your Patriot Pass credentials. Make sure you create a Private project for Homework Assignment 1, and use GitLab (i.e., Git, a version control system) to keep track of your code editing history. For example, if you want to checkpoint your progress, you can commit your changes by running:

git commit -am 'partial solution to homework assignment 1'

Your GitLab repository for Homework Assignment 1 should include:

s3kv.py,

and/or:

s3kv_slap.py,
driver.py,
-AND/OR- your own driver and test code.

When submitting, share your repository with my GitLab ID: yuecheng. I will use the timestamp of your last commit for the purpose of calculating late days.

Note

No credit if your code does not compile/run. Late submissions of homeworks assignments will be penalized at 15% each day, and will not be accepted after 3 days of the due date. The students are responsible for keeping back-ups of their work while they are working on an assignment.

Homework Assignment 2¶

Important

This Homework Assignment is worth 10%. Due: 11:59pm Oct 5.

Introduction¶

This homework assignment will give you experience using Redis as a building block for accelerating S3 performance. In this homework assignment, you will implement a new algorithm for Put/Get operations. You will also add a memory caching tier to improve the I/O performance for S3KV service.

Figure 2: An overview of the system architecture.

Atop the framework you implemented in Homework Assignment 1, you will add a remote memory cache (Redis) to buffer the data fetched from remote S3 object store. See Figure 2 for the logical architecture.

Get/Put Operations¶

Again, client library acquires lock over the key of the data object from etcd before proceeding to a Get or a Put operation. Put initiates a lock from etcd, acquires the lock, and then issues an S3 write to the S3 object store. In this assignment, you will implement a hash-based algorithm for both Put and Get.

Put operation will first compute a hash value by calling a Hash function over the data object that you are supposed to write to S3. How Hash function works is that Hash takes a string as key (in this case the data object) and generates a random string as the hash value. A good hash function is able to guarantee that any two different keys would yield different hash values (with minimal possibility of collision). See how hash function works here.

Put pseudo code works as follows:

def Put(bucketName, key, obj):
    etcd.lock.acquire(bucketName+key)  # acquire the lock over the full path of data obj
    redis.set(key, obj)  # Redis write path
    h = Hash(obj)  # compute the hash value of the data obj
    put_ack = s3.put_object(bucketName, key|h, obj)  # write to S3 under a manipulated key (original k concatenated with the hash value h)
    etcd.mds.write(bucketName+key, h)  # update etcd metadata server (MDS) with updated hash value
    etcd.lock.release(bucketName+key)  # release the lock

You can treat this algorithm as a different way of storing version IDs as you did in Homework Assignment 1. Here in Homework Assignment 2, instead of relying on version ID for achieving consistency, your Put first stores a hash value of the new data object that you want to write to S3 into etcd MDS. This hash value represents a new version of the data object, essentially serving the same purpose as version ID does. Each time you update an existing object, you are essentially creating a new object with a new manipulated key (the original key concatenated with the newly computed hash value). Under this rationale, Get operation does the following four steps: 0. acquire lock; 1. fetch the latest hash value from etcd MDS; 2. perform S3 read with the new key key|h until the fetched object is not NULL; 3. release the lock and return the object.

Get pseudo code works as follows:

def Get(bucketName, key):
    etcd.lock.acquire(bucketName+key)  # acquire the lock over the full path of data obj
    val = redis.get(key)  # Redis read 1: fetch cached obj from Redis
    if val != NULL:  # Redis read 2: for a Redis cache hit,
        etcd.lock.release(bucketName+key)  # release the lock,
        return val  # directly return fetched obj
    h = etcd.mds.read(bucketName+key)  # For a Redis cache miss, read the latest hash value of the data obj
    while True:
        get_ack = s3.get_object(bucketName, key|h)  # trying to fetch the data under key|h
        if get_ack.obj != NULL:  # until the fetched obj is not NULL
            break
    etcd.lock.release(bucketName+key)  # release the lock
    if Hash(get_ack.obj) == h:  # double check the hash value of fetched obj matches h
        redis.set(key, get_ack.obj)  # Redis read 3: populate the Redis cache
        return get_ack.obj  # if yes, return fetched obj
    else:
        return NULL  # else, return NULL

Redis Cache Operations¶

You will embed the Redis cache logics within the Get/Put path. Corresponding Redis operations are highlighted (in bright-yellow) in the above pseudo codes. Put path is straightforward, as you will just need to insert (with redis.set) the latest data object into Redis (Line 3 of Put API). This is essentially a write-through cache. Get path is a bit tricky, in that you will need to first probe Redis by issuing a redis.get, check if the requested object has been cached in Redis already (Line 3 of Get API). If the probe returns a valid object, simply release the lock (Line 5 of Get: to avoid deadlock) and return the object (Line 6 of Get API). If the probe fails, it means that the requested object has not been inserted into Redis cache yet; then go through the regular Put routine, and populate the Redis cache (Line 14 of Get API) only when you fetch the object from S3.

Collaboration Policy¶

Please stick with your current team setup (same as in Homework Assignment 1).

You must write all the code you hand in, except for code that I give you as part of the assignment. You are not allowed to look at anyone else’s solution. You may discuss the assignments with other students, but you may not look at or copy each others’ code. Please do not publish your code on the Internet – for example, please do not make your code visible on GitHub.

Software Development¶

You will implement Homework Assignment 2 on top of the code framework that you wrote in Homework Assignment 1. I supply you with a partial client library implementation (just the framework and some boring bits) and some tests as a starting point.

Required Python (2.7) dependencies (redis and hashib) can be installed using pip:

pip install redis
pip install hashlib

redis is a Python-based Redis client-side library. See instructions about detailed usage. hashlib is a Python-based hash library. See a detailed tutorial about hash functions and hashlib usage here.

You can build Redis from scratch by following the instructions on GitHub. Run Redis by simply typing:

./redis-server --port 6379

For more comprehensive configuration, refer to instructions of Redis configurations.

Feel free to use Dockerized Redis binaries, if you feel that suites your need better. As is known, Docker is for productive DevOps. Again, for the ease of development and testing, you can tear down an already running etcd or Redis instance by simply killing the process, and re-run them for repetitive testing.

Getting Started¶

Download the s3kv (with slight adaptation from Homework Assignment 1) wrapper code here. Your job is to add in your logic to finish the API implementation (Get/Put). For testing, you can modify the driver.py to read from/write to your S3 buckets. Or you can modify the little benchmarking tool s3kv_slap.py to perform single-thread or multi-threaded benchmarking against your S3KV implementation.

Instructions for testing please follow HW1.

Handin Procedure¶

Submit your code via Mason’s GitLab at

https://git.gmu.edu/users/sign_in.

Please create a separate GitLab repository for Homework Assignment 2. Your GitLab repository for Homework Assignment 2 should include:

s3kv_hw2.py,

and/or:

s3kv_slap_hw2.py,
driver_hw2.py,
-AND/OR- your own driver and test code.

When submitting, share your repository with my GitLab ID: yuecheng. I will use the timestamp of your last commit for the purpose of calculating late days.

Extra Bonus (25%)¶

1. If interested, you can measure the runtime of provided workloads against both s3kv and s3kv_hw2, and report the observed workload completion time in your README file of Homework Assignment 2. Redis should significantly improve the performance of read-intensive workloads.

2. Current Get/Put pseudo code does not realize garbage collection (GC). However, current Put algorithm is costly in the sense that it will create extra copies under a unique key (key|h). To reduce the storage overhead, you can implement a GC algorithm that runs as a background thread to periodically remove stale data copies under the original k (associated with a write hotspot).

Note

No credit if your code does not compile/run. Late submissions of homeworks assignments will be penalized at 15% each day, and will not be accepted after 3 days of the due date. The students are responsible for keeping back-ups of their work while they are working on an assignment.

Homework Assignment 3¶

Important

This Homework Assignment is due: 11:59pm Sep 27.

Pick your project¶

In this assignment, you just have to pick what project to work upon. The way you do this: read the projects page and find at least three projects you’d be willing to work upon. Then sign up for a meeting slot with me.

At the meeting, we’ll decide together what project you’ll be working on. Note that my goal is different than yours. Mine is to get an interesting distribution of projects; yours is to do the project you like most. Thus, my advice: don’t get too attached to any one project; rather, have a few in mind that would be palatable.

Projects may be done in groups of up to two. You can fill out your team info here. See you soon!