OPIC (Object Persistence in C) – Felix Chern (dryman) – 1 About Me



OPIC (Object Persistence in C) – Felix Chern (dryman) – 1 About Me

0 0


opic-slide


On Github dryman / opic-slide

OPIC (Object Persistence in C)

Felix Chern (dryman)

Created: 2016-09-03 Sat 10:49

Table of Contents

1 About Me

  • Felix Chern, Software Engineer at Google (2016)
  • OpenX, Big Data Engineering tech lead (2014-2016)
  • SupplyFrame, Big Data Engineer (2013-2014)
  • NTU ME bachelor degree (2011)

2 What is OPIC?

  • Object Persistance in C
    • C objects that can be stored to disk, or transferred over network.
  • Vision: redefine big data industry
  • Strategy:
    • Scalable, off-memory data structures
    • Complexity of O(log(N)) to compete with O(Nlog(N)) platforms
  • Applications: databases, search engines, all super scale data services (e.g. google map)

3 Background

3.1 Data structures

  • Computer Science fundamentals
  • Reduces complexity of some predefined queries or operations
    • index -> data (array)
    • key -> value (map)
    • key -> existance of key (set)
    • enqueue, dequeue
  • Usually implemented in memory

3.2 Big data

  • Large scale data processing algorithm is based on sorting
    • Map Reduce is Map O(N), shuffle (merge sort) O(Nlog(N)), and reduce O(N) total CPU time
    • Hadoop and Spark are both based on merge sort.
  • Sorting may not suit for
    • Fast look up and search by index, join by index etc.
    • OLAP application (slice and dice)
    • Full text search, information retrieval, large scale machine learning
  • Computing time: minutes to days

3.3 Off-memory data structures

  • Database indexes
    • btree
    • hash table
    • primary key, foreign key
  • Search engine indexes
    • suffix tree, suffix array
  • Usually computes in sub-seconds
  • Every data structure is application specific

4 Enter OPIC

  • Write general purpose data structures
    • RB tree
    • B-tree
    • Hash table
    • Graph
  • All data structures can be serialized to disk or network
  • Compare to databases:
    • Off-memory data structures are not application specific.
    • Database, search engine, machine learning, etc. all share the same abstraction.
  • Ideal case: dump a big chunk of memory to disk and restore it back

5 Challenges

  • In memory data structures are presented by pointers
  • How do we manage pointers off-memory?

6 Pointer categories

Imagine we have a managed memory that we can serialize. There are three types of pointers that we need to handle:

  • Internal references in managed memory
  • In-bound: points into the managed memory
  • Out-bound: points to external addresses

6.1 Internal ref & in-bound

  • Turns out internal reference and in-bound pointers are similar
  • serialize:
    • pointer -> reference (valid off memory representation, discuss later)
  • deserialize:
    • reference -> pointer

6.2 Out-bound

  • Depends on what the pointer points to
  • Example:
    • pointing to in-memory object (X)
    • function pointer (Hard)
    • pointing to another managed memory
    • pointing to class type (will discuss later)
  • Generally hard, avoid as much as possible

7 Internal reference and in-bound pointers

7.1 Concept

  • Objects with same type are stored in one (logical) array
  • Reference defined as "type id + offset in array"
  • Typed logical arrays are managed by a memory manager
  • Each memory manager can bulk serialize to disk and restore back
  • A program might maintain multiple memory manager

7.2 Memory manager (1)

  • typed logical array
    • type info: type name, size, etc.
    • type slots
      • Each slot is sizeof(object) * N
      • When slot is full, create new slot of 2N
      • free memory := put object address into a priority queue
      • alloc :=
      • find address in priority queue
      • if not found, find new space in slot

7.3 Memory manager (2)

  • types are stored in type-map
  • also maintain a map of pointer -> type slot
    • so that we can transfer a pointer to reference

7.4 Memory manager (3)

8 Out-bound pointers

  • Can we eliminate all out-bound pointers?
  • No. We need it to represent the type of an object

8.1 Object type implementaions in different languages

  • C++: vtable. number of pointers to vtable is implementation defined
  • Rust: trait pointers. Hard to gather
  • OCaml: JIT & compiled runtime has different form of vtable (?). Garbage collected memory is also hard to serialize.

9 Build our own OO in C

9.1 Why?

  • Build our own vtable runtime is easier than hacking other language's runtime
  • Learn from other language, make it better

10 OO system design

  • Generic type system (e.g. Map<String, Long>)
  • Runtime compose instead of static compose (C++ template is static compose)
    • Smaller binary and memory footprint

10.1 Generic type

  • Example in java:
class Map<WritableComparable> {
  write(WritableComparable data, DataOutput out) {
    data.write(out);
    // write is defined in WritableComparable interface
  }
}
  • Container object can call methods defined in generic interface
  • Container do not need to know the implementation detail of data.write()

10.2 Runtime compose vs Static compose

  • C++ is statically compose (template)
    • Template only lives in header
    • Duplicates binary for each composed types
    • Hard to create shared libraries for containers
  • Hence we choose runtime composed objects
    • more runtime danger
    • May be solved by creating a new language like vala

11 OPIC OO system

11.1 Interface based OO system

  • Define interfaces
  • Define a class that implements one or more interfaces
  • Each class has an global class object
    • global class object direct the interface methods to real implementation
  • Object of same type has a ISA pointer points to the same class object

12 Example: Linked List

12.1 Interfaces: Collection, List

  • Collection:
bool coll_add(OPObject* collection, OPGeneric element);
bool coll_isEmpty(OPObject* collection);
  • List:
OPGeneric li_get(OPObject* list, size_t index);
bool li_insert(OPObject* list, size_t index, OPGeneric element);

12.2 LinkedList Class

  • Declare class object in header (extern)
  • Define class object in C file:
// just pseudo code
struct OPLinkedList_KLASS
{
  const char* const classname = "OPLinkedList";
  const size_t size = 10;
  TypeClass** traits;
  // points to list of interface instances
}
  • Implements
bool LinkedList_coll_add(OPObject* collection, OPGeneric element);
bool LinkedList_coll_isEmpty(OPObject* collection);
OPGeneric LinkedList_li_get(OPObject* list, size_t index);
bool LinkedList_li_insert(OPObject* list, size_t index, OPGeneric element);

12.3 LinkedList Class (cont.)

  • Before entering main:
  • Create Collection typeclass(interface) object
    • Add function pointer to these methods:
bool LinkedList_coll_add(OPObject* collection, OPGeneric element);
bool LinkedList_coll_isEmpty(OPObject* collection);
  • Add the collection typeclass object to OPLinkedList_KLASS.traits
  • Do the same for List typeclass.
  • All the above are wrapped in C Macros.

13 Generic Function in C (1)

  • Goal: a C function that behave differently by object passed in.
OPGeneric li_get(OPObject* obj, size_t index);
  • li_get on linked list takes O(N)
  • li_get on array list takes O(1)
  • Problem: C does not has method override

13.1 Generic Function in C (2)

13.2 Generic Function in C (3)

  • Implementation of
  • list is a generic list, could be array list, linked list, etc.
  • Get list->ISA, unique pointer for each class
  • Use list->ISA as a cache key to lookup function pointer
  • If the function pointer is found, execute fp(list, size_t index)
  • If cache miss,
    • traverse through ISA traits and find the function pointer,
    • store ISA and function pointer to cache (C11 atomic store)
    • execute the function pointer

14 Integrate OPIC OO system with serialization

14.1 Serializing:

  • Each object has one out-bound pointer (ISA)
  • Serializing steps
    • write type header of the typed array
      • type name
      • total size of the typed array
    • Merge multiple slot to one array
    • Convert internal pointers to references
    • Write the array to disk

14.2 deserializing

  • Read the type info from type header
  • Find the type ISA in global type map
  • From the type ISA we know the size of the object
  • For each object
    • Fill the ISA pointer
    • convert references to pointers

15 Other notes

15.1 trade-offs in design (1)

  • No inheritance, even though it could be implemented
    • Enforce static type checking (no impl for interface => complie fail)
  • Use gcc extensions extensively
    • __attribute__((constructor)): register class object to global map before main
    • __attribute__(aligned(256)): align address so we can store some extra flags in the pointers
    • SIMD vector types: Way easier to use than assemblies
    • All the features above is supported by gcc & clang

15.2 trade-offs in design (2)

  • Only support 64bit machines, may limit to x86_64
    • pointer size varies on different architectures
    • too much work to support different size of pointers
      • object size changes
      • reference may look different
      • 64bit->32bit may cause overflow
    • Big data processing machines are all x86_64

16 Future challenges and roadmap

  • concurrent memory manager (C11 atomic)
  • cross memory manager reference
  • LRU cache for memory managers
  • compressed memory manager
  • primitive types compression (may prioritize)
  • distributed computing design (may use consistent hasing intensively)
  • distributed data store design

16.1 Roadmap 2 (nice to have)

  • Implement a mysql storage engine based on OPIC
  • Distributed OLAP prototype.
  • Software transactional memory + durability on data structures
    • WAL for limited set of data structure operations is possible
  • Prototype NoSQL DB like LevelDB/RocksDB
  • Durable program state?
    • Browser JS state serialized and off-load to memory or compressed
    • Smaller memory footprint for browsers

17 Thank you

1
OPIC (Object Persistence in C)Felix Chern (dryman)Created: 2016-09-03 Sat 10:49