Introduction
Hi! My name’s James and I’m a Data Scientist from the UK. I’m originally from Dorset, a rural county on the south coast of England. I started studying Computer Science at secondary school when I was 14, alongside a dozen other subjects, but only really finding a true passion for the subject during my A levels (UK qualifications taken from 16-18 prior to university). Prior to that point, I had imagined my future specializing in Maths, a subject I loved for its problem solving and logic. I discovered that Computer Science not only had these same elements but enabled practical application to rapidly build and create things, and in a way that is accessible to a wide variety of expertise. Still today, I find myself amazed by how simple it can be to do something like implement and deploy an advanced Machine Learning model with very little prior experience.
I would say my passion for Computer Science was sparked not only by the subject itself, but also the shared enjoyment and collaboration with my peers on both personal and school projects. At that time, those projects were primarily in the realm of game design, which presents a huge variety of interesting problems and challenges, such as developing AI agents or procedurally generating levels. Those experiences inspired to apply to study Computer Science at university, and I went on to study Computer Science with Mathematics at Cambridge.
I currently work as a Data Scientist at AdeptID – a small Boston-based startup building predictive talent-matching models and surrounding tooling – where I’ve been for about 9 months, and which is my first ‘proper’ job out of university. I’ve learnt a huge amount in my relatively short time so far at AdeptID, both in terms of technical and non-technical skills. My day-to-day work can vary quite a bit depending on the stage of the project I’m working on, but is generally covered by: feature engineering, implementing and testing prototype models; analyzing and interpreting datasets; building data pipelines; designing and implementing approaches to evaluate model performance; maintenance, bug-fixing and optimization of existing models or pipelines; documenting, presenting or communicating findings; keeping up-to-date with research and industry trends in NLP.
Data Science Fellow
career options
Expertise with key skills in Data Science, including Machine Learning and Data processing, affords several primary career options, with a variety of amounts of engineering work involved, as well as a spectrum from only real-world applications/use cases to almost entirely theoretical/research-based work. Here’s an overview of some of the primary options:
Data Scientists use a variety of tools such as statistical models and techniques, data visualization tools and machine learning approaches to analyze datasets. They then communicate the insights gleaned to stakeholders to inform business decisions. (It’s worth noting that the lines here are somewhat blurry and Data Scientist roles, such as mine, may sometimes more closely resemble a typical job description for a ML Engineer).
Machine Learning Engineers focus on the design, implementation, deployment and scaling of machine learning models, algorithms and pipelines for real-world use applications.
Data Engineers focus on the collection, management and pre-processing/cleaning of data, building data pipelines or managing data warehouses, often at a large scale. This data may then be used/analyzed downstream by Data Scientists or Data Analysts.
AI Research Scientists work on the bleeding edge of AI and ML, developing new techniques or approaches. They may either work in an academic institution, or in R&D in large tech companies, working on theoretical advancements that might later be deployed for real-world use.
Data Science Fellow
skills
What are the main hard skills you use on a daily basis in your current job?
Python is the most popular language of choice for most data scientists, and Numpy and Pandas are critical libraries within Python for working with datasets in an efficient and organized way. I use these tools every day in my regular work, whether I’m working on a script for a production model or performing an EDA in a notebook. My first experience with all 3 of these tools was during a ‘Scientific Computing Practical Course’, set as holiday work in my first term at university.
The Hugging Face Transformers library is an extremely valuable tool for modern-day Nautral Language Processsing. Transformer-based models use an ‘attention mechanism’, allowing the context of surrounding words (both before and after) to influence the semantic meaning of any given word; this is an important advancement over previous recurrent architectures, which had issues with ‘long-range dependencies’ (meanings of words being altered by context that is far away in the text) and processed text sequentially and in a single direction, meaning a sequence would have to be processed twice (once in each direction) by a bi-directional RNN to try to capture the same contextual relationships between words. Transformer models do this a more generalized (and genearlly more efficient/optimized) way. Many of our production models make use of this framework, which helps abstract and streamline necessary pre and post-processing steps, in addition to easily allowing you to instantiate and test different pre-trained models from the HF Hub. I first became familiar with this framework after joining AdeptID, and would highly recommend the introductory Hugging Face NLP Course.
Although it may not be a staple library for most Data Scientists/ML Engineers, DVC is an incredibly powerful tool, not only for version control of large files, but also for automating data and ML pipelines and managing ML experiments. I use DVC regularly to manage files and pipelines within our repository, and although I was familiar with its use for basic version control of files that are too large for Git, I’ve become much more familiar with its other capabilities within the last few months.
Data analysis skills are crucial in biotechnology and biomedical engineering, allowing for interpreting and extracting insights from complex datasets. Proficiency in statistical analysis, knowledge of data analysis techniques, and familiarity with software and statistical packages commonly used in these fields are essential for effective data analysis.
What are the main soft skills you use on a daily basis in your current job?
An aptitude for problem solving and critical thinking is vital for tackling the many technical challenges (both large and small) that arise throughout the day, whether that be in debugging a problematic bit of code or architecting a model for a complex ML problem. Effective problem solving is a hard skill to learn, especially in the general sense, but it grows with experience and exposure to different problems.
Being able to work effectively with others is essential for tackling larger projects, enabling coordination of different areas of the work and pooling of ideas when tackling a problem.
The ability to concisely and effectively communicate thoughts and findings to both technical and non-technical stakeholders is incredibly valuable. I’ve developed this skill in the technical context by contributing or presenting in DS team meetings, as well as in the non-technical context when presenting to the wider team.
James
’s personal path
Tell us about your personal journey in
Data Science Fellow
:
At university, I found a lot of my peers to be quite a bit more career-driven than I was; during my second year, I was barely finding the time to keep up with my schoolwork, whilst some other students were applying to 150+ summer internships! The general sentiment among students then was it was an unwritten requirement that you needed to do a summer internship to secure a good job after graduating. During that period, I thought it was more likely that I would end up staying in academia and doing a PhD, instead of venturing straight into the world of work.
My experiences with those companies that I did apply to for internships (of which there were not too many) gave me valuable insights into which careers in Computer Science really interested me, and which ones did not. I came to realize that more engineering-leaning roles (building tools/applications, working on infrastructure etc.) were not the right fit for me. In fact, the position that most captured my interest throughout their interview process was for a trading role! This company had 3 separate streams for interns (Quant Trading, Quant Research and Software Engineering); I had applied to all 3 and I was surprisingly funneled into the trading stream, despite my Computer Science leaning background. Unfortunately, I didn’t end up receiving an offer, but I think it served a valuable lesson to not put yourself in a box and to explore different career options within or related to your field of study.
The Covid-era was a difficult period for me, and although I ended up receiving a First in my third year, qualifying me to stay on for the integrated Master’s as I had originally planned, I ended up deciding that wasn’t the right choice for me at the time and graduated with my Bachelor’s to take some time out and focus on personal projects. I spent some time focusing on an edge computer vision start-up idea with some school friends of mine. Although the prototype we developed was promising, the project effectively ended up on pause due to people becoming too busy with school or work, however the experience I developed learning Tensorflow, TensorRT and how to work with the Jetson SDK, as well as problem solving in the realm of deep learning and computer vision was invaluable for future projects.
During that time, I was referred to AdeptID by a school friend of mine for a Data Scientist position. The interview process was very substantial, consisting of a take-home exercise, a presentation and discussion of my take-home to the DS team, individual technical interviews which each member of the DS team, and then behavioural interviews with other leads in the company. This resulted in the process dragging out over several months (also partly due to them seeking legal counsel on the process for hiring someone from the UK), and unfortunately resulted in me falling just outside the 1-year window from graduation to qualify for a J-1 visa once I finally received my offer. This was bitterly disappointing news, especially after such a long process, but I made sure to keep in contact in case the situation changed in the future.
Fast forward a year, and they reached back out to me with the news that they were now able to hire me to work remotely from the UK with the possibility of moving over to the US in the future, where I find myself today!
What would you tell your younger you regarding building your current career?
If I were able to give some advice to my younger self, I would emphasize the value of experimenting with projects or opportunities that jump out to you. Even if an opportunity doesn’t end up working out for you, or a project doesn’t end up quite how you wanted it to, the value is in the experience you had and the lessons you learned; if you’re continuing to expand your horizons, build bridges and grow, eventually you’ll end up where you need to be.
Final thoughts & tips
In conclusion, I would say the most valuable ability to hone as a budding Data Scientist is the ability to think critically and clearly; this skill is honed with every problem you tackle, and, with the speed the industry moves at, the ability to be adaptable to new technologies and frameworks becomes more valuable than sinking a ton of hours into becoming an expert in one specific tool. Exposing yourself to different projects and new areas of Computer Science, Data Science and Machine Learning will help build your general toolbox for problem solving, allowing you to quickly pick up new things. In this way, you’re teaching yourself how to think.
Resources to dig in more
Hugging Face NLP Course (Chapters 1-3)
Hugging Face’s Natural Language Processing course, giving an excellent introductory overview of transformer models.
How to build a ML Pipeline with DVC
A short YouTube tutorial from the DVC team on creating ML pipelines; excellent for reproducibility and automation.
PyTorch Image Classifier Tutorial
This is a tutorial for building an image classifier in PyTorch, which is a great introduction if you’re new to PyTorch or image classifiers (or both!).