We design a multi-paradigm programming language. Part 1 - What is it for?

Habr is a wonderful place where you can feel free to share your ideas (even if they look crazy). Habr saw a lot of home-made programming languages, and I will tell you about my experiments in this area. But my story will be different from the rest. First, it will not be just a programming language, but a hybrid language that combines several programming paradigms. Secondly, one of the paradigms will be rather unusual - it will be intended for a declarative description of the domain model. And thirdly, the combination of declarative modeling tools and traditional object-oriented or functional approaches in one language can give rise to a new original style of programming - ontology-oriented programming. I plan to disclose primarily theoretical problems and questions,that I encountered, and tell not only about the result, but also about the process of creating the design of such a language. There will be many reviews of technology and scientific approaches, as well as philosophical discourses. There is a lot of material, you will have to break it down into a whole series of articles. If you are interested in such a large-scale and complex task, get ready for a long reading and immersion in the world of computer logic and hybrid programming languages.



I will briefly describe the main task



It consists in creating such a programming language that would be convenient both for describing the domain model and for working with it. To make the description of the model as natural as possible, understandable for humans and close to the specifications of the software. But at the same time, it must be part of the code in a full-fledged programming language. For this, the model will have the form of an ontology and will consist of concrete facts, abstract concepts and relations between them. Facts will describe direct knowledge of the subject area, and concepts and logical relationships between them - its structure.



In addition to modeling tools, the language will also need tools for preparing the initial data for the model, dynamically creating its elements, processing the results of queries to it, creating those model elements that are more convenient to describe in algorithmic form. All this is much more convenient to do by explicitly describing the sequence of calculations. For example, using OOP or a functional approach.



And, of course, both parts of the language must interact closely and complement each other. So that they can be easily combined in one application and solve each type of problem with the most convenient tool.



I will begin my story with the question of why even create such a language, why a hybrid language, and where it would be useful. In the next articles, I plan to make a brief overview of technologies and frameworks that allow you to combine declarative style with imperative or functional. Further, it will be possible to review the languages ​​for describing ontologies, formulate the requirements and basic principles of a new hybrid language and, first of all, its declarative component. Finally, describe its basic concepts and elements. After that, we will consider what problems arise when using the declarative and imperative paradigms together and how they can be solved. We will also analyze some issues of language implementation, for example, the inference algorithm. Finally, let's look at one of the examples of its application.



Choosing the right programming language style is an important condition for code quality



Many of us have had to deal with supporting complex projects created by other people. It's good if the team has people who are familiar with the project code and can explain how it works, there is documentation, the code is clean and understandable. But in reality, it often happens in another way - the authors of the code quit long before you got to this project, there is no documentation at all, or it is very fragmentary and outdated a long time ago, and about the business logic of the required component, a business analyst or project - the manager can tell only in general terms. In this case, the cleanliness and comprehensibility of the code is critical.

The quality of the code has many aspects, one of them is the correct choice of the programming language, which should correspond to the problem being solved. The easier and more natural a developer can implement his ideas in the code, the faster he can solve the problem and the fewer mistakes he will make. We now have a fairly large number of programming paradigms to choose from, each with its own area of ​​application. For example, functional programming is preferable for computational-focused applications because it provides more flexibility for structuring, combining, and reusing functions that perform operations on data. Object Oriented Programmingsimplifies the creation of structures from data and functions through encapsulation, inheritance, polymorphism. OOP is suitable for data-oriented applications. Logic programming is convenient for rule-based problems that require working with complex, recursively defined data types such as trees and graphs, and is suitable for solving combinatorial problems. Also, reactive, event-driven, multi-agent programming have their scopes.



Modern general-purpose programming languages ​​can support multiple paradigms. The combination of functional and OOP paradigms has long been mainstream.



Hybrid functional logic programming also has a long history, but it never went beyond the academic world. Least of all attention is paid to the combination of logical and imperative OOP programming (I plan to talk about them in more detail in one of the next publications). Although, in my opinion, a logical approach could be very useful in the traditional field of OOP - server applications of corporate information systems. You just need to look at it from a slightly different angle.



Why I find declarative programming style underestimated



I will try to substantiate my point of view.



To do this, consider what a software solution can be. Its main components are: the client part (desktop, mobile, web applications); server side (a set of individual services, microservices, or a monolithic application); data management systems (relational, document-oriented, object-oriented, graph databases, caching services, search indexes). A software solution has to interact with more than just people - users. Integration with external services that provide information via API is a common task. Also, data sources can be audio and video documents, natural language texts, content of web pages, event logs, medical data, sensor readings, etc.



On the one hand, a server application stores data in one or more databases. On the other hand, it responds to requests coming from API endpoints, processes incoming messages, and responds to events. The structures of messages and queries almost never match the structures stored in databases. Input / output data formats are designed for external use, optimized for the consumer of this information and hide the complexity of the application. Stored data formats are optimized for their storage system, for example, for a relational data model. Therefore, we need some intermediate layer of concepts that will allow to combine application input / output with data storage systems. Typically, this middleware layer is called the business logic layer and implements the rules and principles of behavior for objects in the domain.



The task of linking database content to application objects is also not easy. If the structure of tables in the storage corresponds to the structure of concepts at the application level, then ORM technology can be used. But for more complex cases than access to records by primary key and CRUD operations, you have to allocate a separate layer of logic for working with the database. Typically, the database schema is as general as possible, so that different services can work with it. Each of which maps this data schema to its own object model. The structure of the application becomes even more confusing if the application does not work with one data store, but with several, of different types, loading data from third-party sources, for example, through the API of other services.In this case, it is necessary to create a unified domain model and map data from different sources to it.

In some cases, the domain model can have a complex multi-level structure. For example, when compiling analytical reports, some indicators can be built on the basis of others, which in turn will be a source for building the third, etc. Also, the input data can have a semi-structured form. This data does not have a strict schema, as, for example, in the relational data model, but still contains some kind of markup that allows you to extract useful information from it. Examples of such data can be Semantic Web resources, web page parsing results, documents, event logs, sensor readings, preprocessing results of unstructured data such as texts, videos and images, etc. The data schema of these sources will be built exclusively at the application level. There will also be a code,converting the source data into business logic objects.



So, the application contains not only algorithms and calculations, but also a large amount of information about the structure of the domain model - the structure of its concepts, their relationships, hierarchy, rules for constructing some concepts on the basis of others, rules for transforming concepts between different layers of the application, etc. When we draw up a documentation or a project, we describe this information declaratively - in the form of structures, diagrams, statements, definitions, rules, descriptions in natural language. It is convenient for us to think in this way. Unfortunately, it is not always possible to express these descriptions in the same natural way in code.



Let's consider a small example and speculate how its implementation will look like using different programming paradigms



Let's say we have 2 CSV files. In the first file:



The first column contains the client ID.

The second contains the date.

In the third - the invoiced amount,

In the fourth - the payment amount.


In the second file:

The first column stores the client ID.

In the second - the name.

The third is the email address.


Let's introduce some definitions:

The invoice includes the client's identifier, the date, the invoiced amount, the payment amount and the debt from the cells of one line of file 1.

The debt amount is the difference between the invoiced amount and the payment amount.

The customer is described using the customer ID, name and email address from the cells of one line in file 2.

An unpaid invoice is a positive debt invoice.

Accounts are linked to a customer by customer ID value.

A debtor is a customer who has at least one unpaid invoice, the date of which is 1 month older than the current date.

A malicious defaulter is a customer who has more than 3 unpaid invoices.


Further, using these definitions, it is possible to implement the logic of sending a reminder to all debtors, transmitting data about malicious defaulters to collectors, calculating a penalty on the amount of debt, compiling various reports, etc.



In functional programming languagessuch business logic is implemented using a set of data structures and functions for their transformation. Moreover, data structures are fundamentally separated from functions. As a result, the model, and especially its component such as relations between entities, is hidden inside a set of functions, smeared over the program code. This creates a large gap between the declarative description of the model and its software implementation and complicates its understanding. Especially if the model has a large volume.



Structuring a program in object-orientedstyle helps mitigate this problem. Each domain entity is represented by an object whose data fields correspond to the entity's attributes. And relationships between entities are implemented in the form of relationships between objects, partly based on the principles of OOP - inheritance, data abstraction and polymorphism, partly - using design patterns. But in most cases, relationships have to be implemented by encoding them in object methods. Also, in addition to creating classes representing entities, you will also need data structures for ordering them, algorithms for filling these structures and searching for information in them.



In the example with debtors, we can describe classes that describe the structure of the concepts "Account" and "Client". But the logic of creating objects, linking account and customer objects to each other is often implemented separately in factory classes or methods. For the concepts of debtors and unpaid invoices, separate classes are not needed at all, their objects can be obtained by filtering clients and invoices where they are needed. As a result, some of the concepts of the model will be implemented in the form of classes explicitly, some - implicitly, at the object level. Some of the relationships between concepts are in the methods of the corresponding classes, and some are separate. The implementation of the model will be smeared across classes and methods, mixed with the auxiliary logic of its storage, search, processing, format conversion. It will take some effort to find this model in your code and understand it.



The closest to the description will be the implementation of the conceptual model in knowledge representation languages . Examples of such languages ​​are Prolog, Datalog, OWL, Flora and others. I plan to talk about these languages ​​in the third publication. They are based on first-order logic or its fragments, for example, descriptive logic. These languages ​​allow in a declarative form to specify the specification of the solution to the problem, to describe the structure of the modeled object or phenomenon and the expected result. And the built-in search engines will automatically find a solution that meets the specified conditions. The implementation of the domain model in such languages ​​will be extremely concise, understandable and close to description in natural language.



For example, the implementation of the problem with debtors in Prolog will be very close to the definitions from the example. To do this, the table cells will need to be represented as facts, and the definitions from the example should be presented as rules. To compare accounts and customers, it is enough to specify the relationship between them in the rule, and their specific values ​​will be displayed automatically.



First, we declare facts with the contents of the tables in the format: table ID, row, column, value:



cell(“Table1”,1,1,”John”). 


Then we give names to each of the columns:



clientId(Row, Value) :- cell(“Table1”, Row, 1, Value).


Then you can combine all the columns into one concept:



bill(Row, ClientId, Date, AmountToPay, AmountPaid) :- clientId(Row, ClientId), date(Row, Date), amountToPay(Row, AmountToPay), amountPaid(Row, AmountPaid).
unpaidBill(Row, ClientId, Date, AmountToPay, AmountPaid) :- bill(Row, ClientId, Date, AmountToPay, AmountPaid),  AmountToPay >  AmountPaid.
debtor(ClientId, Name, Email) :- client(ClientId, Name, Email), unpaidBill(_, ClientId, _, _, _).


Etc.



Difficulties will begin when working with the model: when implementing the logic of sending messages, transferring data to other services, complex algorithmic calculations. The weak point of Prolog is its description of sequences of actions. Their declarative implementation, even in simple cases, can look very unnatural and requires significant effort and skill. In addition, the syntax of Prolog is not very close to the object-oriented model, and descriptions of complex compound concepts with a lot of attributes will be quite difficult to understand.



How do we reconcile the mainstream functional or object-oriented development language with the declarative nature of the domain model?



The best-known approach is object-oriented design (Domain-Driven Design). This methodology facilitates the creation and implementation of complex domain models. It dictates that all model concepts be expressed in code explicitly in the business logic layer. The concepts of the model and the elements of the program that implement them should be as close to each other as possible and correspond to a single language, understandable to both programmers and subject matter experts.



A rich domain model for the example with debtors will additionally contain classes for the concepts "Unpaid invoice" and "Debtor", aggregate classes for combining the concepts of accounts and customers, factories for creating objects. Implementation and support of such a model is more time consuming, and the code is cumbersome - what could previously be done in one line requires several classes in a rich model. As a result, in practice, this approach only makes sense when large teams are working on complex scale models.



In some cases, the solution may be a combination of a basic functional or object-oriented programming language and an external knowledge representation system.... The domain model can be transferred to an external knowledge base, for example, in Prolog or OWL, and the result of queries to it is processed at the application level. But this approach complicates the solution, the same entities have to be implemented in both languages, the interaction between them must be set up through the API, additionally supported by the knowledge representation system, etc. Therefore, it is only justified if the model is large and complex, requiring logical inference. This will be overkill for most tasks. In addition, this model cannot always be painlessly detached from the application.



Another option for combining knowledge bases and OOP applications is ontology-oriented programming.... This approach is based on the similarities between ontology description tools and the object programming model. Classes, entities and ontology attributes written, for example, in the OWL language, can be automatically mapped to classes, objects and their fields of the object model. And then the resulting classes can be used together with other classes of the application. Unfortunately, the basic implementation of this idea will be rather limited in scope. Ontology languages ​​are quite expressive and not all ontology components can be converted into OOP classes in a simple and natural way. Also, to implement full-fledged inference, it is not enough just to create a set of classes and objects. He needs information about the elements of the ontology in an explicit form, for example, in the form of meta-classes.I plan to talk about this approach in more detail in one of the next publications.



There is also such an extreme approach to software development as model driven development . According to it, the main task of development becomes the creation of domain models, from which the program code is then automatically generated. But in practice, such a radical solution is not always flexible enough, especially in terms of program performance. The creator of such models has to combine the roles of both a programmer and a business analyst. Therefore, this approach could not crowd out the traditional approaches to implementing the model in general-purpose programming languages.



All these approaches are rather cumbersome and make sense for models of great complexity, often described separately from the logic of their use. I would like something lighter, more comfortable and natural. So that with the help of one language it was possible to describe both the model in a declarative form and the algorithms for its use. Therefore, I thought about how to combine the object-oriented or functional paradigm (let's call it the computation component ) and the declarative paradigm (let's call it the modeling component ) within a single hybrid programming language . At first glance, these paradigms look opposed to each other, but the more interesting it is to try.



So, the goal is to create a language that should be comfortable for conceptual modeling based on semi-structured and disparate data. The form of the model should be close to ontology and consist of a description of the entities of the domain and the relationships between them. Both components of the language should be tightly integrated, including at the semantic level.



The elements of the ontology should be entities of the first-level language - they can be passed to functions as arguments, assigned to variables, etc. Since the model - ontology will become one of the main elements of the program, then this approach to programming can be called ontologically oriented. Combining the description of the model with algorithms for its use would make the program code more understandable and natural for a person, would bring it closer to the conceptual model of the domain, and would simplify the development and maintenance of software.



Enough for the first time. In the next post, I want to talk about some modern technologies that combine imperative and declarative styles - PL / SQL, Microsoft LINQ and GraphQL. For those who do not want to wait for the release of all publications on Habré, there is a full text in a scientific style in English, available at the link:

Hybrid Ontology-Oriented Programming for Semi-Structured Data Processing .



All Articles