Sweep Away the Garbage
for scalable, fault-tolerant shared VM storage
Adam Litke - alitke@redhat.com
FOSDEM 2016 - 30 January 2016
The next 40 minutes
- oVirt shared storage architecture
- Preventing data corruption
- Recovering from failure
- Examples
Multi-host local vm storage
Datapath operations
- A VM or host accessing volume contents
- These are the most common and most important
- Lots of IO
- Long running
- Narrow in scope
Example: VM volume access
Example: Host volume access
Metadata operations
- Adding / removing / rearranging storage objects
- Changing storage domain metadata
- Minimal IO
- Short running
- Can have broad scope
Preventing conflicts
- Requirement: data integrity
- Goal: maximize concurrency
- Interaction between storage objects is complex
- Orchestration required across several domains
Same VM on multiple hosts
Conflicting metadata updates
Management level locking
- Entities are locked while executing user-driven actions
- Lock an image during creation
- Lock a VM while taking a snapshot
- Lock a host while it modifies storage
Shared storage locking
- Implemented using Sanlock
- Lockspace is on shared storage
- Leases grant hosts exclusive access to storage resources
- Storage domain lease: needed for metadata changes
- Volume lease: protects volume contents
More about sanlock
- Host IDs
- Every host has a unique ID
- Uniqueness is enforced by SANlock
- IDs must be periodically renewed
- Failure to renew will surrender all resource leases
- Resource leases
- Represent an arbitrary resource (storage or otherwise)
- Misbehaving hosts will be fenced (rebooted)
Process level locking
- Implemented with a local lock manager and RWLocks
- Locks grant threads either shared or exclusive access
- Storage domain lock: protects metadata
- Image lock: protects volume chain and metadata
Handling interruptions
- Some steps in a task are never completed
- Happen naturally or due to bugs
- Power or network outage
- Hardware failure
- Software failure
- Must be carefully mitigated to keep storage coherent
- Approaches
- Storage task manager with rollback capability
- Storage transactions with garbage collection
Interrupted volume creation
Solution: Transactional Storage
- Storage transactions
- Garbage collection
- Monitoring and resolution
Storage transactions
- Storage commands must be a single transaction
- A transaction is opened with a marker operation
- Subsequent steps accumulate "garbage" on storage
- A transaction is committed by converting the start marker
Garbage collection
- Runs periodically on an arbitrary host
- Identifies candidates by finding markers
- Acquires necessary locks for the candidate
- Verifies the candidate should be collected
- Cleans garbage associated with the marker
- Removes the marker
Monitoring and resolution
- Running commands raise events or can be polled
- Progress
- State changes
- Error code and context
- Command results are not persistent
- Success or failure is evident by examining storage
Practical examples
- Create volume
- Remove volume
- Clone volume
Create volatile image directory
Create volatile metadata file
Invoke the garbage collector
Acquire source image lock
Acquire target image lock
Acquire source volume lease
Acquire target volume lease
Mark target volume illegal
Release target volume lease
Release source volume lease
Release target image lock
Release source image lock
Locking order
- Strict rules needed to prevent deadlock
- Storage leases before local locks
- Big containers before smaller containers
- Storage Domain ➡ Image ➡ Volume
- Source volume before destination volume
- Release the newest locks first
Sweep Away the Garbage
for scalable, fault-tolerant shared VM storage
Adam Litke - alitke@redhat.com
FOSDEM 2016 - 30 January 2016