Job Summary
A company is looking for a Data Infrastructure Engineer to design and maintain distributed data systems for AI model training.
Key Responsibilities
- Design and maintain distributed ingestion pipelines for structured and unstructured data
- Support preprocessing of unstructured assets for training pipelines and implement validation checks
- Architect pipelines across cloud storage and optimize large-scale processing with distributed frameworks
Required Qualifications
- 5+ years of experience in data engineering or distributed systems
- Strong programming skills in Python; familiarity with Scala/Java/C++ is a plus
- Proficiency with distributed frameworks such as Spark, Dask, or Ray
- Experience with cloud platforms (AWS/GCP/Azure) and storage systems
- Familiarity with workflow orchestration tools like Airflow or Prefect
Comments