Skip to main content

Building a Lexical Analyzer - Compiler Frontend I

Hey guys, today I'll be building a fully functional Lexical Analyzer that detects unrecognized tokens in C++!


The foundation of this program is derived from another program that counts different tokens in a C/C++ program. While I will try to explain all the basics of Lex here, if you find yourself stranded, you can refer the above link too. Once you understand the method of identifying tokens, the following program is just the converse of it!
Here's a basic overview of what happens in a Lex Program:

Some basics about a Lex Program

Lex (FLEX) basically iterates through your entire input string, trying to match everything it finds in the rules section. Once it finds the pattern, it executes the set of lines listed within the {} brackets.
So you can consider the "rules" an informal way of function declaration.

Rule Section in a Lex Program:                         Rule Section in a C/C++ Program:

It is important to note that Flex does not scan the input character-by-character. Remember, the rules section is just like declaring a regular expression. 
And if you have completed your Second Year, you might know the answer to What identifies a Regular Expression? 

A DFA.  

So the input string is validated as long as the character sequence that are coming in, satisfy the rule. That is why, the rule [0-9] is different from [0-9]+ 

For a more comprehensive information about the rules section, I would direct you to refer GFG's page on FLEX.
If you want to further understand the inner workings of a Lex Program, refer the 'Lex - Theory' Section of this PDF

The Code:

The Explanation:

So now that I've shown you the code, let's walk through the process of building a Lexical Analyzer. 

When should we output an error?
As soon as we see any non-keyword string. Like, 
fore(int i = 0....
Yes, that should give an output as an unrecognized token, while all the other keywords, operators, seperators should be accepted as perfectly valid tokens.

Yet, we are forgetting something incredibly important: identifiers.
Ofcourse! Right when you would write int a, it would return an unrecognized token. That shouldn't happen.

So now we need to make sure that as soon as the 'int', 'float', 'double', etc keywords are encountered, the subsequent strings are passed/accepted until the ';' is encountered.
int a, b, c;
To do that, I added an int variable in the Definitions Section which I'm using as a boolean value to toggle the acceptance of identifiers. You can see that in the rules section of "int" & "float", I'm setting the identify variable and in the ";" section, I'm resetting the variable again.

If the lex program encounters anything other than keywords, operators & seperators, it will now check if it is accepting identifiers through the identify variable.

Great, problem solved. Or is it?
We've solved the integer declaration problem, but what if someone tries to initialize the variable after declaring it? No answers for that!
a = 5;
So we need to save the declared variables, like a...(yes, you guessed it right!) Symbol Table. The main crux of the Compiler and we've finally realized its importance through our implementation failures!

Ideally, to construct the Symbol Table we should build a map. But since the code sections should be in C, I went ahead with using a 2D Character Array in the program, just to store the string values of the identifiers. This can be further improvised to build a Hash Table, but for now, let's just focus on accepting the right tokens.

So when we're deciding if the token is an identifier, all we need to do is just add the identifier if the identify variable is set, or else,
check the 2D Char Array for the occurence of the String in it. If it is present there, accept it as an identifier.
If all this fails, its an invalid token. You're free to output the most savage errors you can imagine.

If you look at the program, you might still not find all the keywords in it. You are free to add all the tokens in the "while"|"if".... section. It's a child's play now that you've understood the working of a Lexical Analyzer, the first part of the Compiler FrontEnd 


Popular posts from this blog

i3wm essentials - II

Welcome back! Let's continue this guide with other setup essentials for i3. Enabling Mousetap Chances are that if you're using a laptop, then tapping on the mousepad does not equal a click for you. You need to enable tapping in your config. Fortunately, there is one documentation available that works for majority of the setups. I don't need to explain this one in detail. Here you go: Enable tap to click in i3 . Volume Control This one is simple again. Do you remember the i3 config file I talked about in the previous blog ? All you need to do is go to that file and find the line: bindsym XF86AudioRaiseVolume Just below that line you will find lines with XF86AudioLowerVolume and XF86AudioMute too. Anyway, the truth is, there are 2 sets of lines with these keywords. Chances are that the line: bindsym XF86AudioRaiseVolume exec --no-startup-id pactl -- set-sink-volume 0 +5% Will be uncommented and the line: bindsym XF86AudioRaiseVolume exec --no-startup-id pactl -- set-sink vo

Namaste JavaScript Quick Notes

Note:  Akshay Saini's Namaste JavaScript is probably the best course for JavaScript developers out there. These are my personal notes that I made while watching the course; they serve more of as an online quick reference for my understanding and revision, and I hope it benefits anyone reading it too! Everything in JS happens inside an Execution Context. Before a JS code is run, memory is allocated and variables are set as undefined   , and functions are set as their exact code in the scope within the Execution Context. The global execution context hosts all the global variables and function definitions. An Execution Context has 2 components: Memory, that stores variables and functions; and Code, that reads and executes the code. Call Stack maintains the order of execution contexts. Since JS is single threaded and asynchronous, at one point of time, only one function is executed which is at the top of the call stack. For each function, an execution context is created before executi

i3wm essentials - I (Brightness)

So you have started using i3 and somehow managed to open your browser and almost resumed your normal work.  But wait, the brightness is too much isn't it? Or is it too low? The mousepad used to work fine, but now all of a sudden tapping does not equal click?!  Don't worry.  This blog series will tell you all about the essential setup commands and common shortcuts that I use to navigate my work in i3, and how you can too. Changing the brightness So you just started i3 and you just can't take this brightness setting. You go for your function keys, and damn! They aren't working. Quick fix: Run the following command if you need to change the brightness ASAP. xrandr -q | grep ' connected' | head -n 1 | cut -d ' ' -f1 This will give an ouput that's the name of your monitor.  Use that monitor name here and change the values of brightness to suit your needs. xrandr --output <monitor-name> --brightness 0.7 Now that your eyes are comfortable, let me show