Graph DBs in the Wild – the case of Douglas E. Goldman Jewish Genealogy Center – The Source Format - GEDCOM



Graph DBs in the Wild – the case of Douglas E. Goldman Jewish Genealogy Center – The Source Format - GEDCOM

0 0


graphs_in_the_wild

A presentation about import family trees from GEDCOM files to neo4j

On Github daonb / graphs_in_the_wild

Graph DBs in the Wild

the case of Douglas E. Goldman Jewish Genealogy Center

presented by Benny Daon / @daonb

The Collection

For the past 30 years people have contributod close to 11,000 family trees with almost 5 million indviduals

The Source Format - GEDCOM

  • Open Format
  • By the LDS Church
  • Initial release - 1984
  • Current version 5.5
  • de-facto standard

GEDCOM Indviduals

0 @I6680@ INDI
1 NAME Albert /Einstein/
2 GIVN Albert
2 SURN Einstein
1 SEX M
1 BIRT
2 DATE 14 MAR 1879
2 PLAC Ulm, Badden Wurttemberg, Germany
1 DEAT
2 DATE 18 APR 1955
2 PLAC Princeton New Jersey
1 OCCU Professor, Physicist and Author of \
the theory of relativity
1 FAMC @F2097@
...
            

GEDCOM Families

0 @F2097@ FAM
1 HUSB @I6676@
1 WIFE @I6679@
1 CHIL @I6680@
1 CHIL @I6681@
1 MARR
2 DATE 10 OCT 1878
2 PLAC Milan Italy
            

python-gedcom

  • Originally written by Daniel Zappala at Brigham Young University
  • GPLv2 hosted on github
  • No support for different encodings
  • No tests
  • CRASHES
  • Java style Python

We're (not) All Consenting Adults

class Element:
    ...
    def level(self):
        """ Return the level of this element """
        return self.__level

    def pointer(self):
        """ Return the pointer of this element """
        return self.__pointer
    
    def tag(self):
        """ Return the tag of this element """
        return self.__tag
    ...

            

Our Fork

  • pythonic
  • a wee-bit of testing
  • handles different encodings
  • exceptions, no crashes
  • undocumented

Why Neo4j?

  • We clearly have a graph
  • Our data is research-worthy
  • Rik Van Bruggen posted about gedcom
  • It's open & free (at least for us)
  • Installation is easy
  • A good frontend
  • py2neo

Why not Neo4j?

It's an ancient Java based project with murky licensing

Modeling Specs

  • lose no data
  • GEDCOM's model is worth keeping
  • keep trees separate
  • allow forest-wide queries
  • protect the privacy of the living

GEDCOM's legacy

Converting the files

Let's Play

The UX

The API

  • Query based on id and radius
  • Represent the graph by ids of parents, spouses, siblings and children
{ "180280": {

  "NAME": "Albert /Einstein/",
  "parents": [ 186959, 183736 ],
  "spouses": [ 188038, 185975, 192031],
  "sibilings": [ 188428 ],
  "children": [ 177829, 187340, 191684 ],
  "id": "@I6680@",
  "SEX": "M",
  "BIRT_PLAC": "Ulm, Badden Wurttemberg, Germany",
  "NAME_GIVN": "Albert",
  "tree_id": "641C1040-12D3-48DA-ABFA-5E723EE6C011",
  "OCCU": "Professor, Physicist and Author of the theory of relativity",
  "DEAT_PLAC": "Princeton New Jersey",
  "FAMS": "@F2098@",
  "NAME_SURN": "Einstein",
  "FAMC": "@F2097@",
  "DEAT_DATE": "1955",
  "BIRT_DATE": "14 MAR 1879",
  ...
} ...}

            
tx = graph.cypher.begin()
tx.append("MATCH (n) WHERE ID(n)={} RETURN n ".format(individual_id))
# parents
tx.append("""
    MATCH (n)<-[:FATHER_OF|:MOTHER_OF]-(r)
    WHERE ID(n)={}
    RETURN ID(r)
""".format(individual_id))
            
# spouses
tx.append("""
    MATCH (n)-[:SPOUSE]-(r)
    WHERE ID(n)={}
    RETURN ID(r)
""".format(individual_id))
# siblings
tx.append("""
    MATCH (n)<-[:FATHER_OF|:MOTHER_OF]-(p)-[:FATHER_OF|:MOTHER_OF]->(r)
    WHERE ID(n)={}
    RETURN DISTINCT ID(r)
""".format(individual_id))
# children
tx.append("""
    MATCH (n)-[:FATHER_OF|:MOTHER_OF]->(r)
    WHERE ID(n)={}
    RETURN ID(r)
""".format(individual_id))
            

Q&A

Graph DBs in the Wild the case of Douglas E. Goldman Jewish Genealogy Center presented by Benny Daon / @daonb