How to write a lexer in C - part 1

A lexer translates text to tokens which you can use in your application. With these tokens your could build your own programming language or a json parser for example. Since i’m fan of OOP you’ll see that I applied an OO aproach. The code quality is good according to ChatGPT. According to valgrind there are no memory leaks. We’re gonna create in this part:

  • token struct
  • token tests
  • Makefile

Requirements:

  • gcc
  • make

Writing token.h
Create a new file called “token.h”.
Implement header protection:

#ifndef TOKEN_H_INCLUDED
#define TOKEN_H_INCLUDED

// Where the code goes

#endif

This will prevent that the file gets double included.
Import the headers we need:

#include <string.h>
#include <stdlib.h>

Define config:

#define TOKEN_LEXEME_SIZE 256

this means our token size is limited to 256 chars. It’s not possible to use a huge string now. Dynamic memory allocation is too much to include in this tutorial.
Define the token struct:

typedef struct Token {
    char lexeme[TOKEN_LEXEME_SIZE];
    int line;
    int col;
    struct Token * next;
    struct Token * prev;
} Token;

Implement new function. This will instantiate Token with default values.

Token * token_new(){
    Token * token = (Token *)malloc(sizeof(Token));
    memset(token->lexeme, 0,TOKEN_LEXEME_SIZE);
    token->line = 0;
    token->col = 0;
    token->next = NULL;
    token->prev = NULL;
    return token;
}

Imlement init function. This will instantiate a Token with given parameters as values.

Token * token_init(Token * prev, char * lexeme, int line, int col){
    Token * token = token_new();
    token->line = line;
    token->col = col;
    if(prev != NULL){
        token->prev = prev;
        prev->next = token;
    }
    strcpy(token->lexeme, lexeme);
    return token;
}

Implement free function. This is our destructor.
It will:

  • find first token using given token
  • will call itself with related token(s)
void token_free(Token * token){
    // Find first token
    while(token->prev != NULL)
        token = token->prev;

    Token * next = token->next;
    if(next){
        token->next->prev = NULL;
        token_free(next);
    }
    free(token);
}

Testing
Now it’s time to build some tests using assert.
Create a new file called “token_test.h”.
Add this:

int main()
{
    // Check default values
    Token *token = token_new();
    assert(token->next == NULL);
    assert(token->prev == NULL);
    assert(token->line == 0);
    assert(token->col == 0);
    assert(strlen(token->lexeme) == 0);

    // Test init function
    Token *token2 = token_init(token, "print", 1, 3);
    assert(token->next == token2);
    assert(token2->prev == token);
    assert(token2->line == 1);
    assert(token2->col == 3);
    assert(!strcmp(token2->lexeme, "print"));
    token_free(token2);

    printf("Tests succesful\n");
    return 0;
}

Now we have a working application. Let’s make compilation easy using a Makefile.
Makefile
Create a file named “Makefile”.

all: tests

tests:
    gcc token_test.c -o token_test
    ./token_test

Run make
That’s it!

So, we created Token which is required for the lexer in next part of this tutorial.

If something not working or you need help; send a message.