Just want to give a brief updates on my current status. I moved from Bay Area working for Arista Networks to [email protected] [email protected] working for Amazon on Oct 2019. Part of the reasons why I haven’t been updating this blog is just the work is far busier compared to what’s in Arista XD.
I’ll have another post about this one and a half year journey in [email protected], across two organizations two teams three managers.
This blog talks about various data encoding methods and their advantages/limitations, along with protocols of transmitting them.
Efficiency is certainly one of the main concerns for various encoding methods. The other thing we need to care about is compatibility. Backward compatibility means that newer code can read data that was written by older code. and Forward compatibility means that older code can read data that was written by newer code. Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.
Anthoer book summary/review!
The book is arranged into these parts:
This blog post only talks about part 1.
I recently read through the book Streaming Systems so think it would be a good idea to write up a summary/thoughts about it. The book is recommended by 评:Streaming System(简直炸裂,强势安利) on zhihu.
The book consists of 10 chapters. The book starts of by stating that traditionally batching systems, streaming systems, databases are three distinct concepts. Batching systems are systems that process finite amount of data (bounded data) producing accurate (strong consistent) (exactly-once) results which typically have higher throughput and higher latency. Streaming systems deal with unbounded data and are typically not as accurate (consistent) as batching systems, which could be implemented by repeated batching. Databases are just persistent data storage that one can do CRUD operations and queries on. The first half of the book summarizes various techniques that modern streaming systems use to achieve certain goals like dealing with out-of-order updates based on event time and end-to-end exactly-once processing. The argument the author wants to make for the first half of the book is that, given the recent improvements of various streaming systems, strong consistency can be achieve in streaming systems (examples are MillWheel, Spark Streaming, and Flink). As a result, streaming systems can have parity with batch.
In the second half of the book, the author presents a view or way of thinking, where streaming systems and databases are just two sides of a coin, where they are just processors of two different forms of data, called “Theory of Stream and Table Relativity”, and show that how those different techniques used in streaming systems like trigger, watermark, and windowing play a role in the new unified world. The author also talks about how to extend the current SQL syntax to support the new unified theory (which unfortunately I personally am not interested in so I just skimmed through that part).
I think the core of this book, or I personally find the most value out of this book, is the point of view that tables (databases) and streams are basically talking about the same thing. This is a conclusion the author draws after many years of working in distributed systems and examining different techniques used by modern streaming systems, this post will talk in the reverse order of the book, such that the theory will first be presented and explained, and after that various techniques will be presented to see how they fit into the big picture. I won’t be going into details of how those techniques work though.
Cost semantics is to discuss: How long do programs run (abstractly)?
The idea of cost semantics for parallelism is that we have the concrete ability to compute simultaneously.
For sequential computation we have:
\[\frac{e_1\mapsto_{seq}e_1'}{(e_1,e_2)\mapsto_{seq}(e_1',e_2)}\] \[\frac{e_1\ val;e_2\mapsto_{seq}e_2'}{(e_1,e_2)\mapsto_{seq}(e_1,e_2')}\]For parallel computation we have:
\[\frac{e_1\mapsto_{par}e_1';e_2\mapsto_{par}e_2'}{(e_1,e_2)\mapsto_{par}(e_1',e_2')}\]Haskell is a dialect Algol!
MA = PCF with a $modality$ - distinguishes expressions from commands
\[\tau=things\ in\ PCF|cmd(\tau) \\ expressions\ e=things\ in\ PCF|cmd(m)\\ commands\ m=ret(e)|bnd(e,x.m)|dcl(e,a.m)\ (dcl\ a:=e\ in\ m)|set[a](e)\ (a:=e)|get[a]\ (get\ a)\]$a$’s are assignables not variables! $x$’s are variables! Assignables are not a
form of an expression of it’s type. Assignables is a location in memory whose contents
has a type, where we write $a_1\sim\tau_1$ (not $a_1:\tau_1$). Assignables are really
indices to a family of $get$, $set$ operations, they are not values, arguments,
or evaluated. They are just indices, and $get$’s and $set$’s are just capabilities
to get and set $a$. We can define references, i.e. &a
in a real programming language,
as a pair $<get_a,set_a>$, which just a thing that gives you access to the capabilities
of getting and setting $a$.
Types and expressions are “pure” - don’t depend on memory, whereas commands are “impure”.