About Us:
Our client is a VC backed high-growth startup building software to make access to knowledge and support easy and instant for the Franchising Industry. The software can look through all information stored across operating manuals, file management systems, training videos, chat groups, etc. and provide summarized answers to franchisee operational questions instantly. However, our end goal is to become the universal interface into all elements of our customer’s tech stack: enabling our customers to query their data, launch workflows and run their operations using simple natural language from one place.
A critical element of our success is the ability to both bring data from customers onto our system (scraping) and the ability to organize this data and convert it into useful information for downstream processes (ingestion). As a data engineer you would own the ingestion process and the challenging problem of reliably converting a plethora of different data formats into a canonical and well-organized standard. Moreover, you would own our orchestration pipeline for the entire data scraping/ingestion process with an eye towards building a reliably and robust system that can scale effectively to hundreds/thousands of customer organizations.
Overview of Responsibilities:
Own and lead implementation for ingesting scraped customer data effectively and reliably into information amenable to use in a Retrieval Augmented Generation (RAG) pipeline.
Own and upgrade our orchestration pipeline with an eye towards rapid and reliable refreshes of customer data and clear visibility into the setup for business purposes.
Collaborate with ML Engineers to design and implement novel techniques to surface valuable information from user documentation using LLMs, multimodal models etc.
Play an influential role in engineering discussions by lending technical expertise and contributing to planning on other projects.
Required skills and experience:
3+ years working as a data engineer or backend engineer with heavy focus on data engineering/orchestration.
Strong experience with Python programming.
Experience working with SQL based databases such as postgres.
Have set up and overseen an orchestration pipeline (preferably Airflow)
Ability to make informed decisions trading off speed versus stability depending on business needs.
Desirable skills and experience:
1+ year experience working in early stage startups.
Experience working with messy customer data including PDFs, excel sheets, scraped website information etc.
Prior experience working with AWS and setting up Lambdas, or Elastic Container Service clusters.
History of working on other devops tasks such as setting up CI/CD pipelines, setting up infrastructure as code, security etc.
Experience working in conjunction with information retrieval/machine learning teams.