about the project
In the spring of 2020, as part of the spring practice at the Computer Science Center, I was developing a new design for the OCaml programming language under the strict guidance of Dmitry Kosarev .
Why OCaml
OCaml is one of the most successful and developed implementations of industrial programming syncretism (hence multiparadigm, multiplatform, very fast compiler, high productivity of generated code) and mathematics (hence state-of-the-art type system with powerful implementation of type inference, expressiveness and extensibility language, closeness to mathematical notation and semantics).
At the same time, the language community is very selective and slowly adds to the language only highly demanded constructions, provided that they do not introduce restrictions into the existing language. Therefore, the core of the language is very small and intuitive, and OCaml is used with pleasure by both industrial developers and, say, mathematicians from the Department of Higher Algebra and Number Theory of St. Petersburg State University .
For a deeper dive into the topic, I suggest taking a look at the articles OCaml for the masses and Why OCaml .
Currently, work is underway to implement a multicore system for OCaml, coupled with algebraic effects, which will simultaneously increase the overall performance of the language and eliminate the existing limitations of the type system associated with the fact that the language allows impure calculations.
Pattern matching and active patterns
My work has focused mainly on the pattern matching construct, which is widely used in functional programming languages.
To illustrate, consider a simple example of rotating a node in a binary tree. In the most common imperative style, the code would probably look like this:
type 'a tree =
| Node of 'a tree * 'a * 'a tree
| Nil
let rotate_left' t =
if is_node t
then
let a = get_left t in
let p = get_value t in
let r = get_right t in
if is_node r
then
let b = get_left t in
let q = get_value t in
let c = get_right t in
Node(Node(a,p,b),q,c)
else t
else t
And here's the exact same code written using the pattern matching construct:
let rotate_left = function
| Node(a, p, Node(b, q, c)) -> Node(Node(a, p, b), q, c)
| Node _ | Nil as x -> x
When using this design, we have the following advantages:
- high expressiveness;
- completeness checking , which is a critical property for correctness checking and program refactoring;
- efficient compilation schemes.
Completeness check means that the compiler, knowing the type definition, can check for each match whether it is true that all possible alternatives have been parsed and that there are no unreachable branches, no matter how complex the samples and no matter how they intersect with each other. Thus, if you change the type definition (adding new alternatives, removing or changing existing ones), the compiler will kindly give you all the places in the code that were directly affected.
For example, if I added new constructs to the syntax tree, the compiler will show me the AST typing code up to the function body, where I need to write the typing code for new constructions:
This property makes OCaml very resistant to refactoring and other code changes.
Despite all the described advantages, there is one very serious limitation on applicability. Have you noticed it? The type definition must be public (so that the compiler can show the alternatives that make it up). And this, of course, immediately breaks down even the simplest abstractions. For example, if we want to define the simplest list interface, it is impossible to tell in advance which type definition to export:
module type List = sig
type 'a t (* = ? *)
val cons: 'a -> 'a t -> 'a t
val empty: 'a t
val head: 'a t -> ('a * 'a t) option
end
However, this problem is not fundamental, and, as noted back in 1987, it is possible to achieve pattern matching on abstract types.
Formulation of the problem
Since 1987, many different designs have been presented in the literature to solve this problem, here are just a few of them:
By the beginning of the project, work had already been done on a reasonable and objective choice of a specific solution for implementation , the most advantageous was the Active patterns extension implemented in the F # language.
The goal of the project was to start implementing Active Patterns for the OCaml compiler and get along as far as possible.
Active patterns
The very idea of ββActive patterns (as well as similar extensions) is very simple: since abstraction is achieved by hiding the implementation inside the function, it is necessary to allow a call to a function inside the pattern matching that would convert the unknown value of the abstract data type to a known list of alternatives. Active patterns encode this list of alternatives right inside the function name. So, the following function must be added to the interface of the list above
(|Cons|Nil|)
:
module type List = sig
type 'a t
val (|Cons|Nil|): 'a t -> ('a * 'a t, unit) choice2
val cons: 'a -> 'a t -> 'a t
val empty: 'a t
val head: 'a t -> ('a * 'a t) option
end
The result is an anonymous sum type
choice2
that has the following definition (there are similar generated types up to choice32
):
type ('a, 'b) choice2 =
| Choice2_1 of 'a
| Choice2_2 of 'b
Thus, the function
(|Cons|Nil|)
will convert the list to one of two alternatives: either to a pair of head and tail of the list, or to an empty alternative meaning that the list was empty.
Defining such a function for a standard list is trivial and would look like this:
let (|Cons|Nil|) = function
| [] -> Nil
| x :: xs -> Cons(x, xs)
As an example of use, consider a function that removes consecutive duplicates in a list:
(* destutter [1;1;4;3;3;2] = [1;4;3;2] *)
let rec destutter = function
| Nil -> nil
| Cons(x, Nil) -> cons x empty
| Cons(x, Cons(y, rest)) ->
if x = y
then destutter (cons y rest)
else cons x (destutter (cons y rest))
Note that all the benefits of pattern matching are retained: the matching syntax is the same and the completeness checks can work in full. Efficiently compiling such a solution is beyond the scope of this overview, but it is also possible.
Progress
As part of this work, it was possible to implement parsing and typing of the extension for the OCaml compiler version 4.09, the results are presented here .
The compiler parser is implemented using the advanced Menhir parser generator . Menhir has a fairly complete and detailed manual , but even with him it was not always clear how you can set the inference rule, and how not and why. The resulting parser patch is quite small and simple, but the path to it lay through 10-15 shift-reduce and reduce-reduce conflicts, the parsing and fixing of which took some time:
I would like to pay tribute to the Menhir developers and thank them for their work on detailing and clarifying the errors. Once the parser-generator failed to illustrate the conflict and had to disassemble it using the constructed automaton for 1500 states. This required, of course, an order of magnitude more time.
Extension typing was especially difficult. The typing code is about 37,000 lines long and is almost undocumented, making it difficult for a beginner to figure it out. I was saved by an article by Oleg Kiselev , which explains the key aspects of implementation.
Another conclusion that I made for myself is not to forget to use the old versions of the project. For example, here's a quantitative comparison of the same piece of typing in the 2019 and 2005 versions:
The 2019 version contains compiler analysis and warnings, as well as additional technical details that I was not interested in. To understand, I just had to look at the 2005 version, which contains only key actions.
Finally, the main conclusion I made during my work is confirmation that documentation for open-source projects is critically important. As expressive as the language is, the source code can only tell how a function behaves, not what it does or why it does it the way it does. It is very difficult to read the call chains
type_self_pattern
-> type_cases
-> t ype_pat
-> type_pat'
-> type_pat_aux
and figure out which function you need; or just one parameter nameconstr
guess which constructors and why should be written here.
The need to look at the use cases every time slows down the programmer and gets tired very quickly: let me remind you the rule "seven, plus or minus two" - exactly as many objects a person can, on average, simultaneously keep in mind. And when you finally understand the meaning of the parameter inside the sixth nested function and suddenly realize that you do not remember why you needed it, or it turns out that you did not need it, it becomes very sad because of the amount of time spent.
Conclusion
As part of a project, I was able to implement parsing and typing for Active patterns. The compiler I patched can parse and type the file with any examples of using Active patterns.
Next, you need to modify the compilation scheme (OCaml uses non-trivial optimizing compilation of pattern matching) and scheme completeness checks, which were almost completely rewritten by the OCaml compiler development team during the work on the project.
I hope this implementation will still be completed, with or without me. Despite all the difficulties, it was great to try yourself in developing an industrial compiler for your favorite language.
Author:
Alexander Bashkirov is a student of the Computer Science Center and a 4th year bachelor's degree program in Software Engineering at St. Petersburg State University, an employee of JetBrains.