This project is no longer accepting applications. Subscribe to our newsletter to be notified of new projects!
Generate synthetic training data using an LLM API to fine-tune a SBERT-based embedding model to cluster similar job titles, evaluating the improvement in the embedding’s performance for retrieval.
Sentence embedding models have a wide variety of use cases in the Natural Language Processing (NLP) landscape, such as to facilitate vector search of unstructured text for semantically similar entries, or within a larger model to enable tasks such as regression or classification on free-text.
During this project, you will devise a prompt to generate a synthetic dataset to fine-tune a pretrained sentence embedding model to address deficiencies in the domain-specific use-case of embedding job titles. You will then use this dataset to fine-tune your pretrained model and evaluate the improvement in performance for the vector search task. Finally, you will deploy your fine-tuned model within a Streamlit web app, allowing you to easily show your model to anyone.
This project will help build your familiarity with basic prompting techniques, querying LLMs and parsing their outputs through an API, as well as popular Machine Learning and NLP frameworks (such as Hugging Face and Sentence-Transformers) and provide a compact end-to end experience from initial prototyping to dataset gathering, to training and evaluation, and finally deployment, covering the core parts of common workflows for a Data Scientist or ML Engineer.
Introductions between the Build Fellow and students. Outline of the structure and motivation for the project and opportunity to ask questions. Initial setup of Python environment and tooling.
Learn basics of how to prompt LLMs to generate useful data that models can learn from: which platform(s)/models to use; how to interact with their API(s); how to reliably parse the output; how to batch for multiple data per prompt.
Addressing any questions/issues that arise from generating a useful synthetic dataset. Opportunity to explore and compare different models/prompts for those that are finished.
Cover the basics of Sentence-Embedding transformer models (how they work and how they’re typically used). Interact with the sentence-transformers and Hugging Face frameworks to download and instantiate an instance of a pre-trained embedding model. Understand the basics of fine-tuning and typical ‘foundation model’ based workflows for model development. Discuss and decide upon fine-tuning approaches and loss functions for project use case, then implement a training script using the generated dataset from previous sessions.
Continue with ongoing fine-tuning work. Address any questions/issues that you are facing with the training process. Opportunity to explore and contrast other training methodologies against validation set.
Evaluate the performance of the fine-tuned model vs the baseline/pre-trained model in the vector search/retrieval context. Understand and implement the appropriate metrics for this use case. Organise code and results in a Juypter/Colab Notebook.
Build and deploy a Streamlit web app that hosts an instance of your fine-tuned model, along with a vector database of test titles to demonstrate model inference and search.
Present and receive feedback on your approach, evaluation results and final web app.
By the end of the 8-weeks project, you will have fine-tuned an embedding model and deployed it to a functional web app, allowing you to showcase the model and make live inferences for vector search. You will also have a GitHub repo containing your code for both generating the dataset and training the model, along with a Jupyter/Colab notebook giving an evaluation of the model’s performance (with appropriate plots/metrics), that you will be able to present alongside the deployed web app.
Get access to all of our Build projects, including this one, by creating your Build account!
Get started by submitting your application.
We'll notify you when projects reopen. In the meantime, you can explore our resources and learn more about our Fellows.
James is a Data Science Build Fellow at Open Avenues, where he works with students leading projects in Data Science. James is a Data Scientist at AdeptID, where he focuses on designing, implementing and evaluating prototype models, analyzing and interpreting datasets and building data pipelines. He holds a Bachelor’s degree in Computer Science with Mathematics from the University of Cambridge.