Creating a Contract Document Generator using Falcon-7b

5 min readSep 24, 2023

OpenAI launched ChatGPT in November 2022 and since then AI Content Generation has taken off. While hundreds debate whether it has the capability of replacing humans, innovative software is proving it only makes us more powerful.

I recently worked on the generation of Legal Contracts and used a similar large-language model to generate business contracts and analyze them. The idea was simple, a user would be creating such legal contractual documents in an editor like Microsoft Word. A quantized generative model was hosted on AWS and consumed inside a plugin directly in MS Word.

The objectives were multifold — Training a model using existing contracts and extracting parameters or rules from them, using the rules to generate documents, analyzing two contracts and also proving the user easy UI/UX for interacting with the document by highlighting these rules.

Generally contractual documents have paragraphs and multiple conditions which can be extracted as the aforementioned rules. A single contract can have hundreds of such rules and the model would only be able to identify them if it has been trained on with that specific type of document(trade, simple business, licensing, etc.)

For instance, delivery details are generally part of all contract agreements. The model should identify this from the text and generate trainable parameters like
delivery_date = 24/09/2025,
delivery_country = India

Word Add-In

The entire ML pipeline was setup in AWS but for the UI, a Word Add-In was used to handle the different operations. With Word add-ins, we used familiar web technologies such as HTML, CSS, and JavaScript and built a solution that can run in Word across multiple platforms, including on the web, Windows, Mac, and iPad.

await Word.run(async (context) => {
    const paragraphs = context.document.getSelection().paragraphs;
    paragraphs.load();
    await context.sync();
    paragraphs.items[0].insertText(' New sentence in the paragraph.',
                                       Word.InsertLocation.end);
    await context.sync();
});

Extracting the rules

The extraction of rules involved a lot of Natural Language Processing and for that we had to preprocess the initial trade related contracts data. On a plot for the distribution of words in the contract text, we could see that most words had a very low frequency. These also are the words that contain the most context.

Distribution of words in the Contract Text

The words with high frequency are generally stop words which can be filtered out as they don’t provide any meaning. Following steps were followed to prepare the dataset.
1. Tokenizing the words in Content Text
2. MultiBinarizing the Business Rules
3. Removing stop words
4. Splitting Data into Train & Test with 80:20 split

The following Keras model was used to train and test the data.

After training the model was able to extract rules from a given contract document with high accuracy.

Generating Contracts

After generating a model from scratch to extract significant business rules, we proceeded with employing the Falcon-7b Large Language Model provided by Hugging Face.

Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
    "Generate a Cost-plus contract between Stark Industries and Hydra Inc."
    max_length=2000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

To run the Falcon-7b LLM, the following specifications were utilized:

CPU Memory: 32Gi
Storage Volume : 50GI
VRAM: Nvidia A100 Multi-instance GPU

The model was paired along with LangChain, a framework designed to simplify the creation of applications using LLMs.

Architecture of the Contract generation and analysis

The rules were generated by the prompts the user would provide and then these rules were fed into the LangChain framework with inputs parameters like:

Stakeholders — The parties involved in the agreement. This data is static and hence not provided to the Falcon model.
Commodity — The product on which the contract is being agreed upon
Clause Language — The language in which the contract is drawn[English/Korean]
Contract Type — 4 Types of Contracts are supported:
Fixed-price contracts
Cost-plus contracts
Time & materials contracts
Unit pricing contracts
Rule Set — This is derived from the prompt given by the user and the NLP model trained earlier. The specificity of the user’s agreement details is used to derive the business rules.

pipeline = pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=5000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

Analyzing and Comparing Documents

After generating the document from the model, the user might want to inject more ‘rules’ or highlight the existing ones in text in the document. Using the Word Add-In, we highlighted the text and gave the user the ability to insert more prompts to generate more precise outputs.

Also, the ability to compare two documents was made possible by choosing the same rules as the input to the Hugging Face model and then displaying the differences. This produced outputs similar to a JSON diff which were used to highlight major differences.

Conclusion

Generative AI Large Language Models can go a long way in removing tedious parts of writing. However, like any other automated tool it requires manual intervention to use it efficiently. Skilled lawyers and writers harnessing the powers of such software can open doors what we think is possible. It’s not about which professionals would lose their jobs to this emerging era. The future would be determined by professionals who utilize Artificial Intelligence and those who don’t.