OPIC (Object Persistence in C)
Felix Chern (dryman)
Created: 2016-09-03 Sat 10:49
1 About Me
- Felix Chern, Software Engineer at Google (2016)
- OpenX, Big Data Engineering tech lead (2014-2016)
- SupplyFrame, Big Data Engineer (2013-2014)
- NTU ME bachelor degree (2011)
2 What is OPIC?
- Object Persistance in C
- C objects that can be stored to disk, or transferred over network.
- Vision: redefine big data industry
- Strategy:
- Scalable, off-memory data structures
- Complexity of O(log(N)) to compete with O(Nlog(N)) platforms
- Applications: databases, search engines, all super scale data services (e.g. google map)
3.1 Data structures
- Computer Science fundamentals
- Reduces complexity of some predefined queries or operations
- index -> data (array)
- key -> value (map)
- key -> existance of key (set)
- enqueue, dequeue
- Usually implemented in memory
3.2 Big data
- Large scale data processing algorithm is based on sorting
- Map Reduce is Map O(N), shuffle (merge sort) O(Nlog(N)), and reduce O(N) total CPU time
- Hadoop and Spark are both based on merge sort.
- Sorting may not suit for
- Fast look up and search by index, join by index etc.
- OLAP application (slice and dice)
- Full text search, information retrieval, large scale machine learning
- Computing time: minutes to days
3.3 Off-memory data structures
- Database indexes
- btree
- hash table
- primary key, foreign key
- Search engine indexes
- suffix tree, suffix array
- Usually computes in sub-seconds
- Every data structure is application specific
4 Enter OPIC
- Write general purpose data structures
- RB tree
- B-tree
- Hash table
- Graph
- All data structures can be serialized to disk or network
- Compare to databases:
- Off-memory data structures are not application specific.
- Database, search engine, machine learning, etc. all share the same abstraction.
- Ideal case: dump a big chunk of memory to disk and restore it back
5 Challenges
- In memory data structures are presented by pointers
- How do we manage pointers off-memory?
6 Pointer categories
Imagine we have a managed memory that we can serialize.
There are three types of pointers that we need to handle:
- Internal references in managed memory
- In-bound: points into the managed memory
- Out-bound: points to external addresses
6.1 Internal ref & in-bound
- Turns out internal reference and in-bound pointers are similar
- serialize:
- pointer -> reference (valid off memory representation, discuss later)
- deserialize:
6.2 Out-bound
- Depends on what the pointer points to
- Example:
- pointing to in-memory object (X)
- function pointer (Hard)
- pointing to another managed memory
- pointing to class type (will discuss later)
- Generally hard, avoid as much as possible
7 Internal reference and in-bound pointers
7.1 Concept
- Objects with same type are stored in one (logical) array
- Reference defined as "type id + offset in array"
- Typed logical arrays are managed by a memory manager
- Each memory manager can bulk serialize to disk and restore back
- A program might maintain multiple memory manager
7.2 Memory manager (1)
- typed logical array
- type info: type name, size, etc.
- type slots
- Each slot is sizeof(object) * N
- When slot is full, create new slot of 2N
- free memory := put object address into a priority queue
- alloc :=
- find address in priority queue
- if not found, find new space in slot
7.3 Memory manager (2)
- types are stored in type-map
- also maintain a map of pointer -> type slot
- so that we can transfer a pointer to reference
8 Out-bound pointers
- Can we eliminate all out-bound pointers?
- No. We need it to represent the type of an object
8.1 Object type implementaions in different languages
- C++: vtable. number of pointers to vtable is implementation defined
- Rust: trait pointers. Hard to gather
- OCaml: JIT & compiled runtime has different form of vtable (?). Garbage collected memory is also hard to serialize.
9.1 Why?
- Build our own vtable runtime is easier than hacking other language's runtime
- Learn from other language, make it better
10 OO system design
- Generic type system (e.g. Map<String, Long>)
- Runtime compose instead of static compose (C++ template is static compose)
- Smaller binary and memory footprint
10.1 Generic type
class Map<WritableComparable> {
write(WritableComparable data, DataOutput out) {
data.write(out);
// write is defined in WritableComparable interface
}
}
- Container object can call methods defined in generic interface
- Container do not need to know the implementation detail of data.write()
10.2 Runtime compose vs Static compose
- C++ is statically compose (template)
- Template only lives in header
- Duplicates binary for each composed types
- Hard to create shared libraries for containers
- Hence we choose runtime composed objects
- more runtime danger
- May be solved by creating a new language like vala
11.1 Interface based OO system
- Define interfaces
- Define a class that implements one or more interfaces
- Each class has an global class object
- global class object direct the interface methods to real implementation
- Object of same type has a ISA pointer points to the same class object
12.1 Interfaces: Collection, List
bool coll_add(OPObject* collection, OPGeneric element);
bool coll_isEmpty(OPObject* collection);
OPGeneric li_get(OPObject* list, size_t index);
bool li_insert(OPObject* list, size_t index, OPGeneric element);
12.2 LinkedList Class
- Declare class object in header (extern)
- Define class object in C file:
// just pseudo code
struct OPLinkedList_KLASS
{
const char* const classname = "OPLinkedList";
const size_t size = 10;
TypeClass** traits;
// points to list of interface instances
}
bool LinkedList_coll_add(OPObject* collection, OPGeneric element);
bool LinkedList_coll_isEmpty(OPObject* collection);
OPGeneric LinkedList_li_get(OPObject* list, size_t index);
bool LinkedList_li_insert(OPObject* list, size_t index, OPGeneric element);
12.3 LinkedList Class (cont.)
- Before entering main:
- Create Collection typeclass(interface) object
- Add function pointer to these methods:
bool LinkedList_coll_add(OPObject* collection, OPGeneric element);
bool LinkedList_coll_isEmpty(OPObject* collection);
- Add the collection typeclass object to OPLinkedList_KLASS.traits
- Do the same for List typeclass.
- All the above are wrapped in C Macros.
13 Generic Function in C (1)
- Goal: a C function that behave differently by object passed in.
OPGeneric li_get(OPObject* obj, size_t index);
-
li_get on linked list takes O(N)
-
li_get on array list takes O(1)
- Problem: C does not has method override
13.1 Generic Function in C (2)
13.2 Generic Function in C (3)
- Implementation of
- list is a generic list, could be array list, linked list, etc.
- Get list->ISA, unique pointer for each class
- Use list->ISA as a cache key to lookup function pointer
- If the function pointer is found, execute fp(list, size_t index)
- If cache miss,
- traverse through ISA traits and find the function pointer,
- store ISA and function pointer to cache (C11 atomic store)
- execute the function pointer
14 Integrate OPIC OO system with serialization
14.1 Serializing:
- Each object has one out-bound pointer (ISA)
- Serializing steps
- write type header of the typed array
- type name
- total size of the typed array
- Merge multiple slot to one array
- Convert internal pointers to references
- Write the array to disk
14.2 deserializing
- Read the type info from type header
- Find the type ISA in global type map
- From the type ISA we know the size of the object
- For each object
- Fill the ISA pointer
- convert references to pointers
15.1 trade-offs in design (1)
- No inheritance, even though it could be implemented
- Enforce static type checking (no impl for interface => complie fail)
- Use gcc extensions extensively
-
__attribute__((constructor)): register class object to global map before main
-
__attribute__(aligned(256)): align address so we can store some extra flags in the pointers
- SIMD vector types: Way easier to use than assemblies
- All the features above is supported by gcc & clang
15.2 trade-offs in design (2)
- Only support 64bit machines, may limit to x86_64
- pointer size varies on different architectures
- too much work to support different size of pointers
- object size changes
- reference may look different
- 64bit->32bit may cause overflow
- Big data processing machines are all x86_64
16 Future challenges and roadmap
- concurrent memory manager (C11 atomic)
- cross memory manager reference
- LRU cache for memory managers
- compressed memory manager
- primitive types compression (may prioritize)
- distributed computing design (may use consistent hasing intensively)
- distributed data store design
16.1 Roadmap 2 (nice to have)
- Implement a mysql storage engine based on OPIC
- Distributed OLAP prototype.
- Software transactional memory + durability on data structures
- WAL for limited set of data structure operations is possible
- Prototype NoSQL DB like LevelDB/RocksDB
- Durable program state?
- Browser JS state serialized and off-load to memory or compressed
- Smaller memory footprint for browsers
1
OPIC (Object Persistence in C)Felix Chern (dryman)Created: 2016-09-03 Sat 10:49