This project is no longer accepting applications. Subscribe to our newsletter to be notified of new projects!
Build a data analytics pipeline to unveil trending open-source projects on GitHub, guiding venture capitalists (VC analysts) to smarter investment decisions.
As companies increasingly depend on vast data volumes for strategic decisions, analysts face the challenge of managing an overwhelming array of tables and data sources not tailored to their needs, leading to repetitive tasks and inconsistencies in analysis due to varied methodologies. In this Build Project, you'll take on the role of a Data Engineer, embarking on a mission to streamline the process for VC analysts by leveraging GitHub data to spotlight emerging open-source technology trends that could signal lucrative investment opportunities. Under the supervision of an experienced industry expert, you’ll develop, configure, and maintain a system designed to periodically fetch, store, and refine Github data, making it readily accessible and useful for data analysts. All this will unfold in a setting that mirrors the real-world operations of a data-driven organization, providing you with practical experience and insights into the challenges and solutions in managing and analyzing large datasets for strategic decision-making.
Get to know all participants · Introduce the build project's goals and expectations · Set up your local dev environment and Github repo
Understand user requirements · Get a basic understanding of data pipeline challenges · Ingest data into Postgres · Open first Pull Request (PR)
Data ingestion. What challenges do we face when ingesting and staging data? Begin loading data into the database so that we can use it.
Create a staging layer. What are some common data cleaning that needs to be done to our source data? Address initial quality issues, and in the process creating your very first dbt models!
Build Core dbt Models. Utilize SQL transformation logic to refine data effectively. Create core dbt models using the staged data, aligning with requirements identified in Workshop 2.
Apply Dimensional Modeling and Create Aggregated Models. Kimball's dimensional modeling techniques to restructure your core models. Simultaneously, develop aggregated models to support various analytical needs. Explore the benefits of fact and dimension tables, as well as precomputed aggregations.
Refactor and Optimize Transformations. Improve pipeline performance and robustness. Explore the use of intermediate models and macros for better efficiency and maintainability. Refine your dimensional and aggregated models based on best practices.
Visualize and Present Your Work. Reflect on the project journey and showcase achievements. Create charts to visualize your models, add visual elements to your GitHub repository, and prepare for the final presentation. Share insights and feedback on the project experience.
Get access to all of our Build projects, including this one, by creating your Build account!
Get started by submitting your application.
We'll notify you when projects reopen. In the meantime, you can explore our resources and learn more about our Fellows.
Edouard Sioufi is a Software Development Fellow at Open Avenues Foundation based in San Francisco, California.
He currently works as CTO/CPO at Big Little Robots and has been writing code since he was 13 years old. Originally from Lebanon, Ed holds advanced degrees in Computer Engineering. In his free time, he enjoys jazz music and reading philosophy.