Cellular: – A proposal for better spreadsheets – James Geddes



Cellular: – A proposal for better spreadsheets – James Geddes

0 0


intro-to-cellular-talk

An introduction to the cellular project

On Github triangle-man / intro-to-cellular-talk

Cellular:

A proposal for better spreadsheets

James Geddes

May 2016

The West Coast Mainline débâcle

Photo: Aaron Roberts https://www.flickr.com/photos/aaronsrailwayphots/ (CC BY-NC 2.0)

In 2012, the franchise to run the train service on the West Coast intercity routes, at that time run by Virgin, was sent out to tender. Bidding companies submitted details of their bid, including detailed financial projections, to the Department for Transport. A company called FirstGroup won the bid and Virgin promptly launched a legal challenge.

As a result of the challenge, the award was scrapped. Three civil servants were suspended. Apparently at least £40m was paid to the bidders in compensation. A review found "significant technical flaws" in the way the bidding process was run.

At the heart of those technical flaws appears to have been a large spreadsheet.

It's not just government

Here's another example.

There were several problems with Reinhart and Rogoff's paper, some of which were simply debatable assumptions, and these had a greater effect on the result than the spreadsheet error. However, it was unarguable that there was a spreadsheet error. It was an illuminating one, and we'll take a look at it later.

Carmen M. Reinhart, Kenneth S. Rogoff (2010). "Growth in a Time of Debt". American Economic Review 100 (2): 573–78.

“George Osborne’s favourite ‘godfathers of austerity’ economists admit to making error in research” — the Mirror

“The Rogoff-Reinhart data scandal reminds us economists aren’t gods” — The Guardian

“The Excel Depression” — Paul Krugman

And some limitations are fundamental

Source: DECC, Performance and Impact of the Feed-In Tariff Scheme: Review of Evidence

Where is the uncertainty?

What should we do in year 2?
Cartoon by Randall Munroe, http://xkcd.com/1667/ (cc by-nc 2.5)

Hello. My name is James Geddes and I have some bad news for you: the world runs on Excel.

From 2002 to 2009 I was a strategy consultant. I helped some of the largest companies in the world make decisions. I helped an airline open a new transatlantic route; I helped a mining company improve the performance of their mines; I even had the privilege of helping a chain of pubs set the price of beer.

How did I know what to advise our clients? I built models. A model is a mathematical simplification of the world. It keeps you honest. It lets you test out your ideas. If I make such-and-such a change, what will happen to my profits?

I worked for one of the most prestigious stategy consulting firms. We pride ourselves on our quantitative focus. We must use some pretty amazing modelling tools, right? Well, we used Excel. Excel is pretty amazing in some ways -- it's amazing because everyone can use it. It's less amazing in other ways.

From 2009 to 2014 I worked at the Department of Energy and Climate Change. I worked with a team attempting to answer the question, how do we reduce the UK's greenhouse gas emissions by 80% over the next 40 years? To answer this question, we built a model: I was the architect of this model although you could say that the godfather of the model was David MacKay. The model simulated the entire energy system of the UK for 40 years. We must have used some pretty complex software, right? Well, we used Excel. It was a pretty sophisticated Excel model in some ways. Less so in others.

Directly as a result of the West Coast mainline debacle, the government produced a review -- the MacPherson review -- to explain how to build and quality assure models properly. That review cited the 2050 model as one of its examples of how to do it right.

Government runs on Excel. That's what government analysts use. Often, of course, analysis is outsourced to a technical consulting company. They might use linear programming or something. But in government: Excel.

Later on, we helped other governments build their own versions of the model. China build one, India built one, even Wallonia built one. Each time we helped a country build their model we would reconstruct our enormous Excel spreadsheet, painfully making hundreds of manual changes, introducing errors as we did so.

Well, I've had enough. I never want to build another Excel model again. Cellular is my plan.

Visicalc (1979) was the “killer app” for the Apple II. You bought an Apple II because you wanted to run Visicalc. Lotus 1-2-3 was the killer app for PCs. Microsoft Multiplan -- 1982 Lotus 1-2-3 -- 1983 Excel Mac -- 1985 Excel Windows -- 1987 Quattro Pro -- 1988 (Intended to compete with Lotus, still available from Corel(!)) Improv -- 1991 (Innovative -- separated data, view, and formulas) Analytica -- 1996 Google -- 2006

Background to spreadsheet modelling

The challenge of doing better

Cellular

Uncertainty

Background to spreadsheet modelling

VisiCalc: the first ‘killer app’

Screenshot of VisiCalc on the Apple II. Photo: apple2history.org

A quick word here about what I'm not going to discuss. VisiCalc did not have any sort of embedded programming environment. Excel does, however: it has Visual Basic for Applications, VBA. VBA is a complete programming language. Any of the restrictions I'm going to discuss could I suppose be worked around by sufficiently clever use of VBA. Indeed, I've built models with VBA -- particularly in order to automate some part of a workflow so that end-users did not have perform a series of manual steps.

I'm not going to make a strong argument here. Let me just say that I have never seen a model which includes VBA which is at all comprehensible. In terms of clarity and robustness, VBA has in my experience only ever made things worse.

Excel has been dominant for 20 years

A history of spreadsheets. Source: Google ngrams.

Excel has, perhaps, on the order of 750 million users.

The challenge of doing better

Reinhart and Rogoff’s spreadsheet. Source: Thomas Herndon, Michael Ash, and Robert Pollin via Mike Konczal
  • NB: Spreadsheet error was not the only issue taken with the paper. But it's a good example of the sort of errors that occur.

  • First, some structures that we see often in a spreadsheet:
  • A set of values down the left; a set of values across the top; and in the intersections values whose meaning is codified by the product of the row and column headers.

    There are many ways to lay out a spreadsheet; this is one of them. It has the virtue of being structured and of conforming to the two dimensions available.

  • A expression whose arguments are a set of related data.

  • There are also some problems common to spreadsheets:
  • Metadata (formatting) used to convey meaning -- where is the key? Is it important?

  • Inputs, intermediate values, and outputs mixed up. A well-designed spreadsheet has a linear ordering of data flow: eg, top to bottom. Here is not clear what the "maximum" etc fields at the top are.

  • Hidden columns. What's going on in there?

  • And of course the major problem with this spreadsheet, which is a simple formula error.

  • How can we fix this kind of error?

How do we make good software anyway?

The art of programming is the art of organising complexity [...] — Edsger W. Dijkstra (1930–2002)

Controlling complexity is the essence of computer programming. — Brian Kernighan (1942–)

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. — C. A. R. Hoare (1934–)

Okay, so how do we control complexity?

Modularity and abstraction

Complex Modular Abstract Decompose the system into separate, weakly-coupled parts, with well-defined interfaces between them.

“Abstraction principle: Each significant piece of functionality in a program should be implemented in just one place in the source code. Where similar functions are carried out by distinct pieces of code, it is generally beneficial to combine them into one by abstracting out the varying parts.”

— Benjamin C. Pierce, Types and Programming Languages

Let me outline for you some of the challenges faced by spreadsheet modellers. I'm going to spend some time on this slide and the next one.

Consider in particular three good practices in software development. There are many practices, of course, such as: unit testing, requirements specification, coding conventions, and what have you; but these two seem to ne to get to the essence of programming as opposed to an arbitrary technical design activity.

First of all, if you have a complicated problem, you should break it down into subproblems, each of which is simpler. Ideally, each subpart would do one, conceptually coherent, thing. Also, the connections between the subparts should be as simple as possible -- they should be "weakly coupled." All this means that one can worry about one subpart at a time.

Second, there's a principle that's often expressed as "Don't repeat yourself." Any time you find yourself doing thing same thing over and over: don't. Write it once, in one place, and re-use the code you've written where you need it. Again, this means fewer places to look and therefore fewer things to worry about at once.

This second step often requires some kind of abstraction or generalisation. It's not always obvious that one is doing the same thing. Often, in fact, one is doing "the same thing, mutatis mutandis, and then the goal is to write a more general model that can be reused in specific circumstances.

Spreadsheets kind of support modularity

  • Worksheets are often used to separate concerns

  • Columns are often used to group elements of a repeated calculation

  • But: it's hard to define the "interfaces"

    =IFERROR(INDEX(INDIRECT($C10 & ".Outputs[" & this.Year & "]"),
               MATCH(G$5, INDIRECT($C10 & ".Outputs[Vector]"), 0)), 0)

In some sense, Excel does quite well here. Of course, it's up to the modeller to correctly choose the subproblems, but once she does so, Excel provides some presentational techniques to help manage the separation. One can place different tasks on different worksheets, for example, or, for simple submodels, in different rows or columns.

Most good modellers will do precisely this. Indeed, organisations typically set up standards for the use of each worksheet: here's an example of one such.

Still, it's not great. It's not enough simply to separate different parts of the model: one must define the ways in which the parts interact. Most programming languages have some facility for constructing the interface between components. That facility will typically ensure that the components cannot be used except through the interface, a constraint known as "information hiding" -- not keeping secrets, but making sure that it's clear how and when one part of the model interacts with another. In Excel one must enforce such separation through convention and practice. For example, in the 2050 Calculator, we made a rule that inputs to a worksheet must exist only in a defined block at the top of the worksheet; when those values were needed somewhere in the worksheet, they were taken from that block at the top, not directly from source.

And perhaps this principle finds its most useful expression in the design of "libraries": pre-built software components that perform some task commonly used in other programs. Almost every Excel model is built de novo. There's no "import workbook". (Well, you can copy and paste but that never leads anywhere good.) Real models of the real world involve physical units and every spreadsheet that uses units has an ad hoc solution to the problem. It would be nice to solve this problem once and allow modellers to re-use that solution. It would be even nicer if, when we discovered bugs in or made improvements to our common solution, those changes could painlessly be incorporated in users' models.

You know -- just to say -- looking at the third example, isn't it absolutely clear that if you can do that -- surely, surely you could write some code!

Spreadsheets do not support abstraction

  • Operations are repeated

  • Large formulae are repeated

    =IFERROR(INDEX(INDIRECT($C10 & ".Outputs[" & this.Year & "]"),
                   MATCH(G$5, INDIRECT($C10 & ".Outputs[Vector]"), 0)), 0)
  • No structured data, higher-order functions, ...

One often wishes to abstract some calculation and the re-use it. It is very hard to do this in Excel. About the best we can do is copy the calculation from one cell to another and make sure that references to other cells are made relative to the copied cell. We can assume that this is what was done, for example, in the Reinhart and Rogoff model.

Why is this a problem? It's a problem because we are repeating ourselves. Once we repeat ourselves, inevitably one of those repetitions will be wrong. There's no way to tell Excel, "look, this is the same calculation, so make sure you really are computing the same calculation."

(Note to experienced modellers: array formulae are a thing that exists, of course. In some sense, array formulae are one of the most "Excel-like" things in Excel. But as you know, they have their own problems. And VBA also exists but -- at least in my view -- there is a huge impedance mismatch between the "Excel" way of doing things and the "VBA" way of doing things.)

Repeating larger blocks is much harder. Typically the problem is that you want to repeat something mutatis mutandis and so the question is, how do you explain which mutandis must be mutatis? Here's an example from the 2050 Calculator.

Why did I write this monstrosity? So that an entire worksheet could be reproduced from a copy of another one, whilst making sure that the copy "knew" it was a copy.

Look aActually, this example is a good illustration of the challenges of the third principle. What does all this palavar mean? In a spreadsheet, every expression is written at the lowest level -- the only level. To work out what something is trying to do it's necessary to, well, work it out. We get around this with conventions -- naming conventions, formatting conventions, structure conventions -- but it's a very weak solution.

This formula looks up a particular energy flow from a particular sector for a particular year, making sure the result is zero if that kind of energy isn't produced or consumed by that sector. But you wouldn't know it to look at this.

It's worse than you think. Because of a flaw in the way certain references are implemented in Excel, one has to encode them in character strings! That, right there, will be -- and was -- an enormous source of bugs when you want to create a similar model.

So our task is clear: figure out how to introduce these development practices into a spreadsheet paradigm. But we must be careful.

So why do people use spreadsheets?

Spreadsheets are incremental

  • Start small, gradually get better
  • Even the basic stuff is useful (eg, making tables)
  • No big jumps in learning

Spreadsheets are immediate

  • Changes are immediately reflected in the output
  • Faster learning, quicker debugging

... and, perhaps, because abstraction is not easy

Here's our main problem. There's a reason that we can't apply those three software development practices in Excel, which is that spreadsheets don't support them. And the reason for that is that spreadsheets do support something else: they support people.

Spreadsheets are incremental. You can start small and build up in easy steps. And even the small stuff is useful.

They are immediate. If you like, the edit-compile-test loop is very, very fast. That makes learning easier and it makes debugging quicker.

And of course, they are not abstract, and perhaps abstraction is hard. Certainly, since there are no abstractions there are, a fortiori, no leaky abstractions.

So if we are going to make something that people will use, we had better keep those features of spreadsheets that made them usable in the first place.

You know, there are two kinds of complexity in the world.

FIXME: Story of Ptolemaic epicycles (AD 100 ish) vs. Newton's laws (late 1600s) resulting in Keplerian motion (early 1610-1620s)

The plan

Observe that spreadsheets are already a lot like a simplistic programming language

Write this language explicitly, so that spreadsheets are written as programs

  • whilst keeping the benefit/cost tradeoff positive!

Generalise the language to include more powerful features

  • without losing the ease of use!

Cellular

Every spreadsheet is a term graph

   |       A         |      B     |
===+=================+============+
 1 | Salary          |      21000 |
 2 | Allowance       |      11000 |
 3 | Taxable income  |  = B1 - B2 |
 4 | Basic rate      |        0.2 |
 5 | Tax payable     |  = B3 * B4 |

Very simple term graph: only values and built-in functions. No user-defined types, no lambda abstractions, no first-class functions, no iteration or recursion.

Here is a simple model written as a spreadsheet. It's a typical spreadsheet model. Computation proceeds from inputs to outputs; in this case from top-to-bottom. It's considered good design practice to lay out a spreadsheet in this way to aid in comprehension. In some sense, it's a primitive form of information hiding: if you want to find the values that enter into a particular computation you know that must look above the cell containing the computation.

The cells of any spreadsheet can be arranged as the nodes in a directed graph. A node in this graph is either a value or an expression of the values in other nodes and the arrows of the graph show which cells depend on which others. I am going to leave out the comments for these examples but clearly part of the challenge will be to figure out how to incorporate them since they play a key role in real spreadsheets.

The graph is acyclic: Excel will complain if the spreadsheet contains what it calls "circular references". It is possible to force Excel to allow cycles, but models incorporating such features are thought of as something of a hack.

Some cells in this example have no dependencies nor are dependent on other cells: these are the labels, which I think are akin to comments in a program.

Well, now we have an expression graph. In fact, we have what looks very like an abstract symantic graph, such as built by a compiler. In which case, we ought to be able to write down one or more languages which "compile to" precisely this expression graph.

This language is extremely simple: it is a purely functional language with neither iteration nor recursion; nor are functions first-class values.

Term graphs are programs

Topological sort: Then write out each step:
B1 := 21000
B2 := 11000
B3 := B1 - B2
B4 := 0.20
B5 := B3 * B4

Poor choice of surface syntax for users: too many definitions. But perhaps useful as a target language which can then be used to generate any particular spreadsheet (Excel, Numbers, ...)

Here's one approach, which is almost a direct translation of the spreadsheet. We take the expression graph of the spreadsheet and linearise it. Note that

Then we simply bind the value of each node to a different symbol. Here I've chosen the symbols to be precisely the names of the cells that gave rise to this graph.

Execution proceeds in a single path: there are no branches or loops. (Conditionals are expressions.)

In this view, a program consists of a series of assignments, the left-hand side of which is a previously unused symbol and the right-hand side is an expression involving primitive operators and functions and previously-named symbols.

Any directed acyclic graph can be written in this way -- it doesn't have to be a tree as in this example.

So this could be our programming language. It has the nice feature that, as a "step by step" recipe, it appears to follow the spreadsheet modeller's thought process.

Of course, we'd still need some sensible way of assigning these variables to cells: a single column spreadsheet is not particularly user friendly.

A larger problem, in my view, is that this style imposes a cognitive burden on the modeller in that it requires the modeller to worry about things they don't want to worry about: specifically, naming things. Choosing names is hard, and the beauty of spreadsheets is that you don't have to choose names: you just use an empty cell. You don't even have to remember the name later, you can just point at the cell.

That barrier seems to me to be too high. My intuition is that spreadsheet modellers will find it too hard to create names. And why should they? This language is too low-level.

However, I think it is a good final target. It's like "machine code for spreadsheets", in that, with the exception of layout, we could convert this into any particular spreadsheet: Excel, LibreOffice, Numbers, Google docs, and so on.

I propose the name "grid" for this language (which is still somewhat undefined).

A possible language hierarchy

cell Full language, supporting abstraction and modularity. Ideally, a superset of nocell. nocell A language with the same expressive power as a spreadsheet (and no more) but a more convenient surface syntax. grid An intermediate representation of a spreadsheet, from which a number of backends could produce specific formats.

(No probabilistic elements yet.)

What's the surface syntax of cell/nocell?

  • Perhaps a conventional function syntax?

    times(0.20, minus(21000, 11000))
  • But spreadsheet modellers tend to think in terms of a "data-flow" model: Here's some data; here's the operation; here's the result; here's an operation on that ...

  • cf. the R package ‘magrittr’, which allows one to write, eg,

    data %>% filter() %>% summarise()
  • => Can we retain this style of interactive / incremental programming?

What about conventional syntax?

Well, it has the benefit over the previous attempt that we don't have to make up names for things. And it looks like a conventional (functional) programming language.

But in fact I think this is a problem. Whatever cellular is, it probably had better not look like a conventional programming language. Spreadsheet programmers don't get programming languages, otherwise they'd be programmers.

My hypothesis here is that -- if we go back to the original tax calculation -- spreadsheet programmers think in a dataflow model. They start with the data (because it's concrete!) then that data is transformed in some way, then it's transformed in some other way, until we get to the answer.

In contrast, the standard nested functional application is precisely the other way round: it says do X to the result of doing Y to the result of doing Z to A. Of course, we could unwrap the functions by naming the arguments -- but then we're back to the naming problem.

Alternative forms of surface syntax

Why not just write the spreadsheet as if you were writing a spreadsheet?

Postfix Spreadsheet-y
21000      ; Salary
11000      ; Personal allowance
-          ; = Taxable income
0.20       ; Basic rate
*          ; = Tax payable 
21000      ; Salary
11000 -    ; less, Personal allowance
=          ; Taxable income
 0.20 *    ; Basic rate
=          ; Tax payable
  • No names (‘;’ means ‘comment’)

  • Order parallels order of construction of spreadsheet

  • Lends itself to building “collections of data” in an interactive manner (see later)

  • But postfix languages haven't caught on. (Why not?)
  • Reminiscent of a printing calculator

  • ‘=’ here means “don’t do anything, but let me label this cell in the comments”

  • Other ideas for usable syntax:
    • ‘=’ means “function application” (like Excel?!)
    • ‘(...)’ means “... is in infix notation”

The thing I have to say here is, I think we just need to get something working and then iterate,

Open questions

Does cell “compile” to nocell?

  • Unwrap loops, expand functions, ...

Or is it a “nocell-constructing” language?

  • “Spreadsheets as values”

  • It would be ideal if, every time the user hits ‘return’, the spreadsheet-so-far is generated, and that spreadsheet is the same as the one obtained by compiling/running the program so far.

Layout: the other half of the challenge

It's clear that we could translate the operational part of programs written in a simple language like nocell into a spreadsheet simply by converting each node in the tree to a cell. It's reasonably plausible that we could make a more expressive language which is then evaluated to produce a nocell program by instantiating function bodies explicitly, unravelling loops, and so on.

But the resulting spreadsheet is likely to be unreadable and that would defeat the whole purpose of cellular.

So a huge challenge is as follows: how do we translate the semantic information inherent in the language cell to a well-structured spreadsheet?

Some things are clear from traditional principles of good spreadsheet design: the values of function application should be in cells below or to the right of their arguments, for example. The equivalent of vector datatypes should be in contiguous blocks of cells if possible.

Plausibly, most high-level operations that are used by spreadsheet models are map/reduce type operations. Those should be laid out in contiguous cells as well.

Some things are easy: we know what counts as sub-calculations, so we can provide appropriate formatting to each cell.

This is a challenge -- but I also think it's an enormous opportunity.

Making nice-looking, well-structured tables out of data is notoriously hard. A recent article in Significance called in "making sunbeams from cucumbers". If we could provide a mechanism that made it easier for people to make tables, that would be a reason to use the software, and could overcome analyst's natural reluctance to move beyond Excel.

I don't have an answer, but it feels like an answerable task. I think perhaps we should be thinking in terms of types, as in data types. It feels to me that tables are a visual representation of algebraic data types. To be explored.

Summary

cell nocell grid Semantics Functional, first-class functions, rich set of data types, user-defined data types, Purely functional, primitive types only (numeric, string, boolean, perhaps vector), built-in functions only, names. Like nocell, but every expression assigned to a cell. Names changed to named ranges. Surface syntax ?
100
200
300
sum
A1 := 100
A2 := 200
A3 := 300
A4 := SUM(A1, A2, A3)
Other Structures annotated with information to inform layout. Cells annotated with information to determine cell style.

Adding uncertainty to models

PPL = Probabilistic Programming Language
  • Use an existing language for stochastic simulation and inference, restricting class of models to those expressible in nocell

  • Add stochastic data with a few new VBA functions (eg, =NORMAL(0,1))

  • Output of PPL modelled as Monte Carlo runs on separate worksheet, with named ranges, eg, =MEDIAN(param1)

Challenges

  • Must make at least some tasks easier, eg,

    • Formatting tables (always a pain), "LaTeX for spreadsheets";

    • Make probabilistic calculations possible.

  • Must extend the "natural" language of spreadsheets

  • Must allow incremental transition from concrete to abstract

  • Must be interactive

    • eg, a "REPL" that generates a spreadsheet each time

Other proposals

  • User-centred functions (SPJ)

  • Gencel (templates for Excel)

  • ModelSheet Authoring (specific kinds of models)

  • Tabular

  • Improv

  • Scenarios (FW)

All of these use the spreadsheet as the GUI

One good question to ask is: what haven't these succeeded?

  • Maybe they will, just not yet?

  • They involve changing the spreadsheet model

  • They involve extensions to Excel (breaks collaboration)

UNUSED SLIDES

Other kinds of models

  • Maximisation

  • Dynamic simulation

  • Agent-based simulation

  • Equilibrium-finding

  • “Full” Bayesian inference

1
Cellular: A proposal for better spreadsheets James Geddes May 2016