Skip to content

Spoke Scaling Plan

Shaka Lee edited this page Mar 20, 2018 · 1 revision

How to Scale Spoke: A Development Planning Doc

This document is aspirational and outlines a plan to improve spoke scaling. It should be read as an "RFC"-type proposal document, that, if accepted, should also outline the development plan even during intermediate stages of implementation, and then hopefully, morph into architecture documentation.

The Scaling Goal

Spoke should be able to scale to 10s of thousands of synchronous/simulataneous texters and 10s of millions of contacts.

Current known bottlenecks

The database and API requests currently (2/2018) scales to around a 100 simultaneous texters and thousands of contacts. Uploading contacts is per-campaign and needs to scale to much larger files and campaigns. Should we consider contacts being organized at an 'organization' level, rather than per-campaign? This would reduce database size across many campaigns. On the other hand, perhaps a system could clear out archived campaign data, and we should just facilitate that better.

Additional possible bottlenecks

Auth0 may not scale to 10s of thousands of contacts, and api validation requests on each call seem like a likely bottleneck just at the outside-request level Twilio is likely to scale much better than our system, but at the millions of contacts, there may be issues or API-quotas that might be hit. We should consider better supporting Twilio's own suggestions on how to scale high volume

Optimizing the Message-Response Cycle

The message-response cycle is the most important thing to optimize -- texters sending out messages initially, and then handling replies and knowing when there are new replies to handle.

Client-side optimization

Currently in containers/AssignmentTexterContact.jsx, each contact screen loads an individual contact's information and then calls an api again for the next screen. This is incredibly inefficient and instead an asynchronous process in the containers/TexterTodo.jsx component should gather X contacts in sufficient quantity to feed it to the AssignmentTexterContact.jsx component as fast as it processes the data.

Server-side optimization

To remove the database as a bottleneck we plan to introduce an optional Redis caching layer for all jobs that are part of the Message-Response Cycle. That includes:

accessRequired -- login, and more importantly access control for all the texter API calls getNew -- all assigned contacts (w/ info) for a texter that have no message-thread yet and have status=needsMessage getReplies -- all assigned contacts (w/ info) who have replies to be responded to -- status=needsResponse incomingMessage -- the Twilio API call to update a contact's message thread and update status=needsResponse sendMessage -- the API call from the texter to send a message to the contact (and update the status and message thread) updateQuestionResponses -- the API call from texter to update the questionResponseValues of the contact updateAssignments -- changes from the campaign admins to assign texters dynamicAssignment -- for dyanmic assignment-enabled campaigns, allowing a texter to 'take' a queue of contacts for assignment and begin sending. DB write queue While the Redis cache layer will suffice for the real-time read/write interface during high traffic, we still want to ensure that the DB is eventually synchronized. This should be done with a message queue that synchronizes the database (eventually). Making this separate from Redis will allow for fault tolerance, but also adds an additional technology to be integrated and maintained. This work may align with other interest in making the existing Spoke job queue (to handle, e.g. texter assignment, contact uploads) -- but might not?

Some technologies that could support this:

Amazon SQS Heroku AMQP: https://elements.heroku.com/addons/cloudamqp Should we have a 'backup option' that is 'pure redis' so folks can use that optionally? Redis Data-structures Besides satisfying the above data needs/write-workflows, keys should expire naturally when texters disengage (or go to bed :-)

Here is the (proposed) structure of data in Redis to support the above data needs (c_id is "contact id" and message_service_id is probably global, but maybe related to the organization in a multi-tenant configuration):

HASH: texterinfo-<texter_id> (access

Keys: {auth0_id, <org_id>=, is_superadmin}

HASH: replies-<texter_id>-<campaign_id>

Keys are the incoming message_id, with values with serialized (?JSON) content: {contact_cell, c_id}

QUEUE: conversation-<contact_cell>-<message_service_id> -- the list of messages for a contact, both sent and received

HASH: contactinfo-<contact_cell>-<message_service_id>

Keys with values include {assigned texter_id, assignment_id, org_id, questionResponseValues, [contact info including, e.g. city/state]}

QUEUE: newassignments-<texter_id>-<campaign_id> -- full contact info for all new assignments (status=needsMessage)

HASH: unsentassigned-<texter_id>-<campaign_id>

Keys are <contact_cell>; values are full contact info and assignment id/status (same as values in newassignments-<texter_id>-<campaign_id> above)

QUEUE: dynamicassignments-<campaign_id> -- full contact info for all contacts ready for dynamic assignment in a campaign

KEY (regular SET call): campaign-<campaign_id> -- campaign data that is loaded in TexterTodo and TexterTodoList components

Workflows using Data-structures Access Control

Load from texterinfo- getNew (needsMessage)

Load (and LPOP) from newassignments-<texter_id>-<campaign_id> If newassignments is empty, we check unsentassigned-<texter_id>-<campaign_id> for content. If it does, we ?resend that data with an additional setting in the API getNew so the client knows that the 'original queue' is empty. (if not empty) HSET unsentassigned-<texter_id>-<campaign_id> <contact_cell> [the assignment data] getReplies (needsRepsonse)

Load from replies-<texter_id> For each contact, load contactinfo-<contact_cell>-<message_service_id> incomingMessage

Load contactinfo-<contact_cell>-<message_service_id> to lookup the texter id assigned the contact LPUSH conversation-<contact_cell>-<message_service_id> LPUSH message-write-queue HSET replies-<texter_id> (using lookup) sendMessage

If status is needsMessage then confirm that it's the first item in conversation-<contact_cell>-<message_service_id>, otherwise, do not (re)send message. LPUSH conversation-<contact_cell>-<message_service_id> LPUSH message-write-queue updateQuestionResponses

HSET contactinfo-<contact_cell>-<message_service_id> updateAssignments

Either LPUSH newassignments-<texter_id>-<campaign_id> OR dynamicassignments-<campaign_id> dynamicAssignment (when the texter requests more assignments)

LPOP dynamicassignments-<campaign_id> HSET contactinfo-<contact_cell>-<message_service_id>