Mendoza - The totally non-human readable diff format for structured JSON documents
Written by Magnus Holm
When we started work on the recently released feature Review Changes, we needed a way to keep a significant part of the edit history of a document in the browser memory to be able to respond quickly to different user interface states. As the user picked various document versions to compare we wanted to be able to quickly reconstruct a specific section of the history of a document.
For text diffs, we use the diff-match-patch format, and we just assumed someone would have implemented a similarly efficient and compact diff format for JSON documents, but no such luck. If we wanted a general JSON diff format that was super compact and fast to apply, we would have to invent it ourselves. And thus, Mendoza, the totally non-human readable diff format for structured JSON documents, was born.
Mendoza is:
- Lightweight JSON format
- A flexible format that can accommodate more advanced encodings in the future
- As a Go library for encoding and decoding
- A JavaScript library for decoding
- Efficient handling of the renaming of fields
- Efficient handling of reordering of arrays
- Not designed to be human-readable
Mendoza differs (hah!) from normal diffs as they are:
- Made for humans to read and understand and based on simple operations (like keep, insert, and delete text)
- Possible to apply even if the source has changed a bit by including some of the contexts around every part that has changed
- Designed for text, and not structured documents
Now, this is great when you are collaborating with humans on code development and use something like git to track your changes. What it isn’t great for, however, is expressing differences between structured documents (such as JSON) in a compact manner that can be efficiently transferred over the network and parsed in JavaScript inside of browsers.
Most diffs aren't meant for machines
Most diff formats are made to be human-readable. Take these two documents, where a key and the array have some changes between them:
{
"name": "Bob",
"EXPERIMENTAL_foobar": {"key": "123"},
"values": [
{"_key": "abc"},
{"_key": "def"},
{"_key": "ghi"}
]
}
{
"name": "Bob",
"foobar": {"key": "123"},
"values": [
{"_key": "ghi"},
{"_key": "abc"},
{"_key": "def"}
]
}
If these where two commits, the Git diff between them would be expressed like this:
--- left.json 2020-10-06 21:49:06.000000000 +0200
+++ right.json 2020-10-06 21:49:35.000000000 +0200
@@ -1,9 +1,9 @@
{
"name": "Bob",
- "EXPERIMENTAL_foobar": {"key": "123"},
+ "foobar": {"key": "123"},
"values": [
+ {"_key": "ghi"},
{"_key": "abc"},
- {"_key": "def"},
- {"_key": "ghi"}
+ {"_key": "def"}
]
-}
+}
This makes it somewhat practical for humans to understand what is going on when the latter change is applied. But as you can see, in terms of pure data, there is a lot of repetition going on. and expressing all changes with only plusses and minuses isn't very efficient.
The same diff with Mendoza is expressed like this:
[19,0,10,0,14,"foobar",11,2,21,2,3,21,0,2,15]
Mendoza constructs a minimal recipe for transforming a document into another. All it really does is to compare two JSON documents and figure out the most minimal way to express their difference as strings and integers in an array. You can use this difference to reconstruct the first document to the other.
How to read a Mendoza patch (even though you shouldn't)
A Mendoza patch consists of an array of integers and strings. The integers are opcodes (short for “operation codes”), 8-bit numbers that correspond to an operation. Opcodes take parameters as strings, positive numbers, or JSON values (that is: the actual data that is changing). The list of available opcodes is as follows, notice that 10-18 are composites of the preceding ones:
- 0
Value
- 1
Copy
- 2
Blank
- 3
ReturnIntoArray
- 4
ReturnIntoObject
- 5
ReturnIntoObjectSameKey
- 6
PushField
- 7
PushElement
- 8
PushParent
- 9
Pop
- 10
PushFieldCopy
- 11
PushFieldBlank
- 12
PushElementCopy
- 13
PushFieldBlank
- 14
ReturnIntoObjectPop
- 15
ReturnIntoObjectSameKeyPop
- 16
ReturnIntoArrayPop
- 17
ObjectSetFieldValue
- 18
ObjectCopyField
- 19
ObjectDeleteField
- 20
ArrayAppendValue
- 21
ArrayAppendSlice
- 22
StringAppendString
Mendoza reads these opcodes from the patch and produces the resulting document from them. Depending on the patch, Mendoza might choose not to strictly follow the opcodes but take a simpler path. If every field and value has changed, for example, it’s more efficient just to replace the whole document with the new data without going through all the operations. Or if you have a list of objects where one has moved to another position and changed a key-value, Mendoza will manage to go back to the original and represent the change in a cheap way.
Run your own Mendoza
Mendoza is implemented in Go and can be found in this GitHub repository. We have also made a parser for Mendoza patches in JavaScript, that you can use in your own application.
Of course, you can dive into the source code for the Sanity Studio and explore how Mendoza is used there. If you want a slightly simpler use-case, you can also check out how we’re using Mendoza to simulate a part of Sanity’s real-time datastore in the browser to power the real-time preview for Next.js.
Why “Mendoza”?
Naming a project is always difficult. Since this project is focused on representing changes between JSON documents I naturally started thinking about names like "JSON differ, JSON changes, JSON patch, …". However, most of these names have already been used by existing projects. While I was muttering "JSON, JSON, JSON" it suddenly turned into "JSON, JSON, Jason, Jason Mendoza".
Jason Mendoza is a character from the show The Good Place, and while this project has little in common with the stupidest DJ from Florida, at least it's short and catchy.
Some caveats
Since a Mendoza patch is just describing the effect of a change it is also limited to work for the documents it was based on. It doesn’t come with guarantees for consistency if the document you apply it on has changed from the original in meanwhile. This is one of the tradeoffs that we needed to do to make it really compressed.