Your application

Please complete the following fields to be considered for this project.

Please fill in this required field.
Please fill in this required field.
Please fill in this required field.
Please fill in this required field.
Please fill in this required field.
How much commitment will you have to this project?
Please select an option.
Are you available to dedicate 1-2 hours per week to the Build Project?
Please select an option.
Your application has been 
successfully submitted!
Explore more projects
Close
You already submitted an application for this project.
Explore more projects
Close
There was an error submitting your form. Please try again later or contact us.
Oops! Something went wrong while submitting the form.

This project is no longer accepting applications. Subscribe to our newsletter to be notified of new projects!

Get updates
Fine-Tune a Job Title Embedding Model using Synthetic Training Data
James Alner
James Alner
Get updates
Register today
Apply now

Fine-Tune a Job Title Embedding Model using Synthetic Training Data

Generate synthetic training data using an LLM API to fine-tune a SBERT-based embedding model to cluster similar job titles, evaluating the improvement in the embedding’s performance for retrieval.

Register today
Apply now
Thursdays
 at
2:00
P.M.
 ET /
11:00
A.M.
PT
8 weeks, 2-3 hours per week
Expert
No experience required
No experience required
Some experience required
Degree and experience required

Description

Sentence embedding models have a wide variety of use cases in the Natural Language Processing (NLP) landscape, such as to facilitate vector search of unstructured text for semantically similar entries, or within a larger model to enable tasks such as regression or classification on free-text.

During this project, you will devise a prompt to generate a synthetic dataset to fine-tune a pretrained sentence embedding model to address deficiencies in the domain-specific use-case of embedding job titles. You will then use this dataset to fine-tune your pretrained model and evaluate the improvement in performance for the vector search task. Finally, you will deploy your fine-tuned model within a Streamlit web app, allowing you to easily show your model to anyone.

This project will help build your familiarity with basic prompting techniques, querying LLMs and parsing their outputs through an API, as well as popular Machine Learning and NLP frameworks (such as Hugging Face and Sentence-Transformers) and provide a compact end-to end experience from initial prototyping to dataset gathering, to training and evaluation, and finally deployment, covering the core parts of common workflows for a Data Scientist or ML Engineer.

Session timeline

  • Applications open
    December 1, 2024
  • Application deadline
    January 15, 2025
  • Project start date
    Week of July 8, 2024
    Week of
    February 3, 2025
  • Project end date
    Week of

What you will learn

  • Develop an appropriate prompt and script to call a LLM API and parse the output to produce a synthetic dataset.
  • Fine-Tune a pretrained sentence-embedding transformer model using an appropriate loss function.
  • Evaluate the performance of a baseline embedding model vs a fine-tuned model in a retrieval context.
  • Deploy a Streamlit web app to showcase a model and allow inference.
Build Projects are 8-week experiences that operate on a rolling basis. Selected participants engage in weekly live workshops with a Build Fellow and 2-15 other students.

Project workshops

1
Introduction and Environment Setup
2
LLMs for Synthetic Data Generation
3
LLMs for Synthetic Data Generation (continued)
4
Fine-Tuning Sentence Embedding Models
5
Fine-Tuning Sentence Embedding Models (continued)
6
Evaluating Fine-Tuned Model Performance
7
Deploying a Streamlit Web App
8
Presentation and Wrap-up

Prerequisites

By the end of the 8-weeks project, you will have fine-tuned an embedding model and deployed it to a functional web app, allowing you to showcase the model and make live inferences for vector search. You will also have a GitHub repo containing your code for both generating the dataset and training the model, along with a Jupyter/Colab notebook giving an evaluation of the model’s performance (with appropriate plots/metrics), that you will be able to present alongside the deployed web app.

Sign up today

Get access to all of our Build projects, including this one, by creating your Build account!

Register today
Log in

Apply to

James

's project today!

Get started by submitting your application.

Apply now

Stay updated!

Subscribe to our newsletter to be notified when projects reopen!

Please fill in this required field.
By clicking “Subscribe” you agree to our Terms of Services and Privacy Policy.

Thanks for subscribing!

We'll notify you when projects reopen. In the meantime, you can explore our resources and learn more about our Fellows.

Discover our articles
There was an error submitting your form. Please try again later or contact us.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
About the expert

James is a Data Science Build Fellow at Open Avenues, where he works with students leading projects in Data Science. James is a Data Scientist at AdeptID, where he focuses on designing, implementing and evaluating prototype models, analyzing and interpreting datasets and building data pipelines. He holds a Bachelor’s degree in Computer Science with Mathematics from the University of Cambridge.

Visit
James
's Linkedin