An introduction to compiler theory: lexical analysis of Pascal using C #

Introduction



Recently, most newcomers to programming start with high-level languages ​​such as Java, Python, C #, or any other language that contains a "gentleman's set" in the form of a garbage collector, ready-made data structures, and so on. Of course, this approach has its advantages, but, as a rule, a novice developer who uses the ready-made functionality of the language misses the most important thing - its structure and mechanisms of operation and implementation.



I will not go into the details of memory allocation and ways of interpreting the code, but on the contrary, I would like to talk about the compiler itself, namely the lexical analyzer, and try to implement it in C #. The overwhelming majority knows the language that we will analyze - it is Pascal.



The lexical analyzer is the first of the compiler's β€œlayers” responsible for highlighting tokens for further processing.



A lexeme is the smallest unit of a dictionary that represents our language. Service words, operators, identifiers and so on can serve as a token.



Implementation



Description of the structure



The formal description of the language will be stored in two arrays: in the first - service words, and in the second - delimiters and a list with found tokens



private string[] Words = { "program", "var", "integer", "real", "bool", "begin", "end", "if", "then", "else", "while", "do", "read", "write", "true", "false" };
private string[] Delimiter = { ".", ";", ",", "(", ")", "+", "-", "*", "/", "=", ">", "<" };
public List<Lex> Lexemes = new List<Lex>();




The lexeme itself will store the key, which will be used to determine the type membership (service words, operators, identifiers, numbers), the id of the token and the value itself.



class Lex
{
    public int id;
    public int lex;
    public string val;

    public Lex(int _id, int _lex, string _val)
    {
        id = _id;
        lex = _lex;
        val = _val;
    }
}


The best solution for processing tokens is a state machine. This will get rid of unnecessary if-s, and also make it easy to make changes to the loop. S is the initial state, NUM, DLM, ASGN, ID is the state of the corresponding types of tokens, ER will be used for error, and FIN for the final state.



private string buf = ""; //    
private char[] sm = new char[1];
private int dt = 0;
private enum States { S, NUM, DLM, FIN, ID, ER, ASGN, COM } //  state-
private States state; //   
private StringReader sr; //    


The main methods are SearchLex, which looks for a token in our array and returns its id and value in a tuple (yes, tuples can be useful too), and PushLex, which adds a new token to the dictionary.




private (int, string) SerchLex(string[] lexes)
{
    var srh = Array.FindIndex(lexes, s => s.Equals(buf)); 
    if (srh != -1)
        return (srh, buf);             
    else return (-1, "");
}

private (int, string) PushLex(string[] lexes, string buf)
{
    var srh = Array.FindIndex(lexes, s => s.Equals(buf));
    if (srh != -1)
        return (-1, "");
    else
    {
        Array.Resize(ref lexes, lexes.Length + 1);
        lexes[lexes.Length - 1] = buf;
        return (lexes.Length - 1, buf);
    }
}


Algorithm implementation



The first step is to determine the end of the cycle - the FIN state, and also to implement the initial state, which will be



sr = new StringReader(text); //    
while (state != States.FIN)
{
    switch (state)
    {
        case States.S:
            if (sm[0] == ' ' || sm[0] == '\n' || sm[0] == '\t' || sm[0] == '\0' || sm[0] == '\r' )
                GetNext();
            else if (Char.IsLetter(sm[0]))
            {
                ClearBuf();
                AddBuf(sm[0]);
                state = States.ID;
                GetNext();
            }
            else if (char.IsDigit(sm[0]))
            {
                dt = (int)(sm[0]-'0');
                GetNext();
                state = States.NUM;
                
            }
            else if (sm[0] == '{')
            {
                state = States.COM;
                GetNext();
            }
            else if (sm[0] == ':')
            {
                state = States.ASGN;
                ClearBuf();
                AddBuf(sm[0]);
                GetNext();
            }
            else if (sm[0] == '.')
            {
                AddLex(Lexemes, 2, 0, sm[0].ToString());
                state = States.FIN;
            }
            else
            {
                state = States.DLM;
            }
        break;
    }  
}


The GetNext method allows you to get the next character in the string, ClearBuf, respectively, clears the buffer that stores the token



private void GetNext()
{
    sr.Read(sm, 0, 1);
}


Particular attention should be paid to the ": =" assignment operator, which consists of two separate operators. The simplest way to define this operator is to add a condition and write an intermediate value to the buffer. For this, a separate ASGN state was implemented ( in translation, assign - assignment ). If the buffer is defined as ":", the algorithm will simply add a new token, and if the next sign is "=", then one assignment operator will be added.



case States.ASGN:
    if (sm[0] == '=')
    {
        AddBuf(sm[0]);
        AddLex(Lexemes, 2, 4, buf);
        ClearBuf();
        GetNext();
    }
    else
        AddLex(Lexemes, 2, 3, buf);
    state = States.S;
break;


The final state and the state with an error are implemented only by service messages. You can improve this option and check the error as well, but perhaps this functionality can be transferred to the parser.



case States.ER:
    MessageBox.Show("  ");
    state = States.FIN;
    break;
case States.FIN:
    MessageBox.Show("  ");
    break;


Testing



You can test the algorithm in different ways: specify the path of the .pas file directly , programmatically create a string, or any other convenient option. Since we are writing in C #, it will not be difficult to add a form to the application, which will contain 2 textBoxes, the first for entering the program code, the second - displays the result of the algorithm.



By pressing the button, we will start the text analysis, and the resulting result will be processed using the switch construction: in addition, we will display the type of the found token.



private void button1_Click(object sender, EventArgs e)
{
    textBox2.Clear();
    TplMain tpl = new TplMain();
    tpl.Analysis(textBox1.Text);
    
    foreach(var lex in tpl.Lexemes)
    {
        switch (lex.id)
        {
            case 1:
                textBox2.Text += "id: " + lex.id + " lex: " + lex.lex + " val: " + lex.val + " |" + "   "+ Environment.NewLine;
                break;
            case 2:
                textBox2.Text += "id: " + lex.id + " lex: " + lex.lex + " val: " + lex.val + " |" + "  " + Environment.NewLine;
                break;
            case 3:
                textBox2.Text += "id: " + lex.id + " lex: " + lex.lex + " val: " + lex.val + " |" + "  " + Environment.NewLine;
                break;
            case 4:
                textBox2.Text += "id: " + lex.id + " lex: " + lex.lex + " val: " + lex.val + " |" + "  " + Environment.NewLine;
                break;
                
        }     
    }       
}


Input data



program hellohabr;
	var a, b, c : integer;
begin
	c := a - b + 15;
end.


Output



id: 1 lex: 0 val: program |   
id: 4 lex: 1 val: hellohabr |  
id: 2 lex: 1 val: ; |  
id: 1 lex: 1 val: var |   
id: 4 lex: 1 val: a |  
id: 2 lex: 2 val: , |  
id: 4 lex: 1 val: b |  
id: 2 lex: 2 val: , |  
id: 4 lex: 1 val: c |  
id: 2 lex: 3 val: : |  
id: 1 lex: 2 val: integer |   
id: 2 lex: 1 val: ; |  
id: 1 lex: 5 val: begin |   
id: 4 lex: 1 val: c |  
id: 2 lex: 4 val: := |  
id: 4 lex: 1 val: a |  
id: 2 lex: 6 val: - |  
id: 4 lex: 1 val: b |  
id: 2 lex: 5 val: + |  
id: 3 lex: 1 val: 15 |  
id: 2 lex: 1 val: ; |  
id: 1 lex: 6 val: end |   
id: 2 lex: 0 val: . |  


Complete algorithm



public void Analysis(string text)
{
    sr = new StringReader(text);
    while (state != States.FIN)
    {
        switch (state)
        {

            case States.S:
                if (sm[0] == ' ' || sm[0] == '\n' || sm[0] == '\t' || sm[0] == '\0' || sm[0] == '\r')
                    GetNext();
                else if (Char.IsLetter(sm[0]))
                {
                    ClearBuf();
                    AddBuf(sm[0]);
                    state = States.ID;
                    GetNext();
                }
                else if (char.IsDigit(sm[0]))
                {
                    dt = (int)(sm[0] - '0');
                    GetNext();
                    state = States.NUM;

                }
                else if (sm[0] == '{')
                {
                    state = States.COM;
                    GetNext();
                }
                else if (sm[0] == ':')
                {
                    state = States.ASGN;
                    ClearBuf();
                    AddBuf(sm[0]);
                    GetNext();
                }
                else if (sm[0] == '.')
                {
                    AddLex(Lexemes, 2, 0, sm[0].ToString());
                    state = States.FIN;
                }
                else
                {
                    state = States.DLM;

                }

                break;
            case States.ID:
                if (Char.IsLetterOrDigit(sm[0]))
                {
                    AddBuf(sm[0]);
                    GetNext();
                }
                else
                {
                    var srch = SerchLex(Words);
                    if (srch.Item1 != -1)
                        AddLex(Lexemes, 1, srch.Item1, srch.Item2);
                    else
                    {
                        var j = PushLex(TID, buf);
                        AddLex(Lexemes, 4, j.Item1, j.Item2);
                    }
                    state = States.S;
                }
                break;

            case States.NUM:
                if (Char.IsDigit(sm[0]))
                {
                    dt = dt * 10 + (int)(sm[0] - '0');
                    GetNext();
                }
                else
                {

                    var j = PushLex(TNUM, dt.ToString());
                    AddLex(Lexemes, 3, j.Item1, j.Item2);
                    state = States.S;
                }
                break;
            case States.DLM:
                ClearBuf();
                AddBuf(sm[0]);

                var r = SerchLex(Delimiter);
                if (r.Item1 != -1)
                {
                    AddLex(Lexemes, 2, r.Item1, r.Item2);
                    state = States.S;
                    GetNext();
                }
                else
                    state = States.ER;
                break;
            case States.ASGN:
                if (sm[0] == '=')
                {
                    AddBuf(sm[0]);
                    AddLex(Lexemes, 2, 4, buf);
                    ClearBuf();
                    GetNext();
                }
                else
                    AddLex(Lexemes, 2, 3, buf);
                state = States.S;

                break;
            case States.ER:
                MessageBox.Show("  ");
                state = States.FIN;
                break;
            case States.FIN:
                MessageBox.Show("  ");
                break;
        }
    }
}


Conclusion



It may seem that the lexical analyzer is not a very clear thing, and actually not very important. Why can't you put all this in a parser? How to work with complex structures? Yes, the ways to implement the lexical analyzer differ from compiler to compiler, but when you analyze all these principles, you will not only understand how the X programming language works, but you will also have a foundation for developing your own programming language: a second Python, or a language for your domain - all this is possible to realize with understanding of all the specifics of the work and the device of the compiler in general.



The project can be found here



All Articles