Library for defining custom Grammars
Find a file
2026-02-09 00:42:12 -05:00
gradle/wrapper initial comit 2025-08-04 20:47:43 -04:00
src PatternElement returns the node/capture matched rather than an int. Captured tokens now appear in Node.span 2026-02-09 00:42:12 -05:00
.gitignore initial comit 2025-08-04 20:47:43 -04:00
AGENTS.md updates to AGENTS.md 2026-02-07 03:28:02 -05:00
build.gradle.kts PatternElement returns the node/capture matched rather than an int. Captured tokens now appear in Node.span 2026-02-09 00:42:12 -05:00
gradle.properties initial comit 2025-08-04 20:47:43 -04:00
gradlew initial comit 2025-08-04 20:47:43 -04:00
gradlew.bat initial comit 2025-08-04 20:47:43 -04:00
LICENSE add Apache 2.0 license file 2025-08-29 23:14:00 -04:00
README.md KDoc for all public API. 3.0.0-RC1 2026-01-25 17:40:53 -05:00
settings.gradle.kts initial comit 2025-08-04 20:47:43 -04:00

Gramur

Introduction

Gramur is a minimal Context-Free Grammar (CFG) lexer-parser generator written in Kotlin. Given a Grammar declaration, this library will transform raw character input into an Abstract Syntax Tree (AST) according to any production rules you declare. A straightforward DSL is provided for quickly defining grammars, to get you from raw input to well-structured contextual data with as little work as possible.

Gramur is intended for use in Kotlin applications. Other JVM-based languages may be able to utilize Gramur, but no support is explicitly provided or guaranteed.

Key Goals

Gramur is intended to bring you from raw text input to a fully-constructed AST according to your grammar specifications as quickly as possible. The scope of Gramur is narrowly limited to this, and does not extend to post-processing, machine translation, or code generation beyond the construction of the AST.

Gramur supports Regular (Type-3) and Context-Free (Type-2) Grammars according to the Chomsky Hierarchy. Gramur supports both left- and right-recursive grammars, as well as complex recursive nonterminal configurations, by maintaining an internal pushdown stack of nonterminal frames which guards against infinite recursion.

Obtaining Gramur

Building from Source

Gramur is built using Gradle, which requires a Java Runtime to execute. JRE 24+ is required to build/use Gramur; ensure a compatible JRE is installed and available on the system path by running java -version.

  1. Clone the repository using Git:

    git clone https://forgejo.archandle.net/vivi/gramur.git
    cd gramur
    
  2. Build the project using the included Gradle wrapper:

    ./gradlew build
    # or `./gradlew.bat build` on Windows
    

The resulting JAR will be located in build/libs/.

Downloading a Prebuilt Jar

See Releases for prebuilt JARs.

Adding as a Dependency

Gramur is available as a Maven dependency on the AS Forgejo Repository.

To add Gramur as a dependency to your project using Gradle, add the following to your build.gradle.kts (replace <version> with your desired Gramur version):

repositories {
	maven {
		url = uri("https://forgejo.archandle.net/api/packages/vivi/maven")
	}
}

dependencies {
	implementation("net.archandle.gramur:Gramur:<version>")
}

Usage Summary

Gramur is conceptually divided into two phases: Lexing and Parsing.

Lexing: Lexer, Lexemes, Tokens

The Lexer transforms character input into a stream of tokens based on defined Lexemes. You can create a Lexer instance directly, or build one using lexeme(...) DSL statements inside a Grammar declaration. Call Lexer.lex(input) to begin lexing.

The Lexer will repeatedly attempt to match Lexemes, in order they are defined, against the input at the current position. If a Lexeme produces a match, the corresponding Token is produced and the Lexer advances by one or more characters. If no Lexemes can produce a successful match at the current Lexer position, the Lexer will raise an exception. lex() will succeed if and only if the entire input sequence is consumed.

A set of predefined common Lexemes is provided in BasicLexemes, which produce Token.Generic tokens. If you require metadata-bearing tokens or complex Lexeme matching, you can also define your own Lexemes by implementing Lexeme or extending AbstractLexeme.

  • Lexeme: Defines a pattern for recognizing tokens (e.g., whitespace, identifiers, numbers)
  • Token: The result of matching a lexeme against input
  • Lexer: Processes character sequences and produces tokens

Parsing: Rules, Productions, PatternElements, Nodes

Unlike the Lexer, there is no Parser instance; parsing is performed directly on a Grammar definition, utilizing the named rules defined in the Grammar.

Rules are patterns of token-matching elements similar to Lexemes, but operate on Tokens rather than raw input, and produce AST Nodes. Within a Grammar definition, use rule(...) DSL statements to define production rules recognized by your grammar. Call Grammar.parse(tokens) to begin parsing. You can also call Grammar.parse(input: CharSequence) to implicitly invoke the Lexer defined within the Grammar and parse the resulting tokens in one step.

Rules consist of PatternElements which outline the sequence of tokens and nonterminal expansions expected from the input Token stream in order to match the rule. Rules can include more than one Production; each Production is a sequence of PatternElements that, when matched, produce a Node according to that Production's Node factory function.

  • Rule: A nonterminal symbol which includes one or more Productions to match against the input Token stream
  • Production: One of the possible ways a nonterminal Rule can match against input, producing a Node
  • PatternElement: A syntactic element of a Production which defines the constraints on how input can be matched
  • Node: Represents a hierarchical AST node providing structural & semantic context for the input Token stream

Building a Grammar using the included DSL

Gramur provides an easy-to-use DSL for defining grammars. Within a Grammar { } scope, use rule(name) { ... } to begin defining a new Rule. For each Production in the Rule, use the available DSL functions to define the sequence of PatternElements for this Production, then call produces { ... } to define how the Tokens captured by this pattern should be transformed into a Node.

Below is an example of a simple grammar which parses basic arithmetic expressions, with support for signed integers, multiplicative operator precedence, basic binary operators (+ - * /), and parentheses for grouping:

object ExpressionGrammar {
	sealed class ExpressionNode: Node()

	data class IntNode(val value: Int): ExpressionNode()
	data class ParenNode(val expr: ExpressionNode): ExpressionNode()
	data class BinOpNode(
		val left: ExpressionNode,
		val op: Operator,
		val right: ExpressionNode
	): ExpressionNode()

	enum class Operator {
		ADD, SUB, MUL, DIV
	}

	val grammar = Grammar("sum") {
		lexeme(BasicLexemes.Whitespace)
		val intLexeme = lexeme(TestLexeme.IntBearer())
		val openParen = lexeme(BasicLexemes.Char('('))
		val closeParen = lexeme(BasicLexemes.Char(')'))
		val addOpLexeme = lexeme(BasicLexemes.CharChoice('+', '-'))
		val mulOpLexeme = lexeme(BasicLexemes.CharChoice('*', '/'))

		lateinit var sum: Rule<ExpressionNode>

		val factor: Rule<ExpressionNode> = rule("factor") {
			token(intLexeme)
				.produces {
					IntNode(it.tokenAtAs<TestToken.IntBearer>(0).intValue)
				}

			token(openParen)
				.reference { sum }
				.token(closeParen)
				.produces { ParenNode(it.nodeAtAs<ExpressionNode>(1)) }
		}

		val term: Rule<ExpressionNode> = rule("term") {
			leftRecursion()
				.token(mulOpLexeme)
				.reference { factor }
				.produces {
					BinOpNode(
						it.nodeAtAs<ExpressionNode>(0),
						if (it.tokenAt(1).value == "*") Operator.MUL else Operator.DIV,
						it.nodeAtAs<ExpressionNode>(2)
					)
				}

			reference { factor }
				.produces { it.nodeAtAs<ExpressionNode>(0) }
		}

		sum = rule("sum") {
			leftRecursion()
				.token(addOpLexeme)
				.reference { term }
				.produces {
					BinOpNode(
						it.nodeAtAs<ExpressionNode>(0),
						if (it.tokenAt(1).value == "+") Operator.ADD else Operator.SUB,
						it.nodeAtAs<ExpressionNode>(2)
					)
				}

			reference { term }
				.produces { it.nodeAtAs<ExpressionNode>(0) }
		}
	}
}

Matching against an input is achieved with the following sample call:

val input = "(1 + 2) * 3"
val ast = ExpressionGrammar.parse(input, rule = "sum")

/*
 * ast ==
 * BinOpNode(
 *    left = ParenNode(
 *       BinOpNode(
 *          IntNode(1),
 *          Operator.ADD,
 *          IntNode(2)
 *       )
 *    ),
 *    op = Operator.MUL,
 *    right = IntNode(3)
 * )
 */

ast is the root node of the produced AST, ready for further processing by your application logic.

License

Gramur is licensed under the Apache License, Version 2.0.

AI/LLM Usage Disclosure

This software is written with LLM tooling assistance. Please do not download/install or include a dependency to Gramur for any environment or toolchain whose policy forbids AI usage.

Usage Notes

  • A local Kotlin-focused model is used for inline code completion.
  • Beyond simple inline code completion, all final production source code is written only by human.
  • Online chat models (Claude Sonnet 4.5, ChatGPT-5.2) are consulted post-hoc for code review, test suggestions (test logic written by human), and trivial bug-hunting.
  • All documentation and KDoc are initially drafted using the Junie Agent running a Claude Sonnet 4.5 Code model. The documentation is then manually reviewed and touched-up by human.