Building management frameworks for distrubuted systems using Python and SaltStack
## Who we are
- Indian Open software based storage company
- Based out of a non urban centre for certain philosophical reasons
- Interests extend beyond storage/IT
## Our intentions in storage/IT
- Build high performance storage stacks (software and hardware) that are affordable to all markets, not just the developed, leveraging opensource stacks.
- Parallel file system based product
- Single node unified storage product (Unified-NAS)
## About this presentation
- Our experiences in building out a product based on a distributed file system (glusterfs)
## What we will cover
- A brief description of our product
- Why we needed a distributed management framework
- A brief intro to saltstack
- How we are using saltstack to address out need
- Our experiences (joys and pain points)
## Disclaimers
- No allegiance to salt stack
- Purely our experiences
- Not played around much with other stacks so no grounds to do in depth comparisons
## Our PFS based product
- Hardware optimized for high performance, low power consumption. Soon moving to support any underlying hardware.
- Based on open underlying software stacks - Linux, Centos, Gluster.
- Use gluster to aggregate the storage on multiple nodes to present as one name space
- Significantly simplified, unified, management framework to manage the whole system
##Why we need a distributed management framework
- We deal with potentially a large number of systems
- Need to configure all the systems from a single point
- Need to moniter all the systems from a single point
- Need to troubleshoot all the systems from a single point
## Our potential choices
- Write our own underlying communication/management infrastructure. (Reinventing the wheel)
- Choose an existing open source communication/management infrastructure. (much more robust and sustainable choice)
## What we need from the infrastructure
- Strong developer community
- Python based (since we are python based)
- High performance (since we potentially will be communicating with a large number of systems)
- Open Source
## Why Saltstack
- Answered all our needs :
- Open source with an active developer community.
- Python based
- Based on zeromq - high performance, parallel operations
## What is Saltstack
- A brief intro/history of Saltstack
## Saltstack model
- Has a master/minions model
- Can target all/subset of minions from the master
- Salt keys (requested, accepted, deleted)
## Some networking prerequisites
- All our nodes need initial IP/DNS configuration through their terminal
- All the nodes are salt minions
- Have 2 salt masters
- All minions are preconfigured to point to the salt masters using a preset DNS name
## So how do we use SaltStack
- When any node boots up after the initial configuration :
- Communicate with the salt master
- We know the list of pending minions.
- Use this in the UI to add nodes into our monitored pool
- Once they are part of the salt's acceped minion list, they can be monitored remotely.
## Example - Creating a gluster distributed volume
- Requires an understanding of gluster concepts
- Using gluster manually, required admin to choose the location of data.
- Since we know about all the nodes through salt, we distribute the volumes as widely as possible to maximize performance.
- But volume creation has some prereuisites - needs gluster bricks to be places in zfs datasets
- We offer a choice of deduplicated, compressed or normal underlying storage into which the gluster bricks can be placed.
- Based on the admin's choice, we use salt to first create datasets for the bricks with the appropriate properties on all the nodes.
- What happens if this fails on any node?
- We need to do a selective rollback - again using salt
## Example - Constant monitoring
- Requires monitoring of many resources on each node to be collected and reported centrally.
- We use custom saltstack modules to pull information from various sources - ipmi, hard disks (smart monitoring, etc), zfs status, network stats, cpu, memory, gluster, etc..
- These scripts are run on a cron from the primary nodes to fetch json formatted status data.
- Alerts are setup based on certain conditions triggered from this status data.
- Our web based monitoring framework then pulls this latest data whenever needed.
- Salt also allows us to do things like pull logs from any node on demand.
## Distributed Test Framework
- Automating FIO testing and result collection using SaltStack
## Many more examples
- We use salt extensively for many more functionalities - dynamic DNS updates, handling disk failures, node flagging, services control, distributed samba control, etc..
- Would have taken us a lot, lot longer to implement all this without the use of a platform like SaltStack
## We are open
- We are just cleaning out our code
- Watch https://github.com/fractalio for more updates.
- Contributors welcome :-)