You are viewing a preview of this job. Log in or register to view more details about this job.

Data Science Intern

Data Science – Intern

We are seeking a highly motivated Data Science Intern who is eager to work on real, production‑oriented data solutions, not sandbox projects. In this role, you’ll help ship data pipelines, analytical features, and models that support live business use cases. You’ll work end‑to‑end across a modern data stack—including Databricks/Spark, Snowflake, Python, PySpark, SQL, and Azure/AWS/GCP—from ingestion and transformation through modeling, visualization, and deployment.

Throughout the internship, you’ll collaborate closely with experienced data engineers and data scientists, receiving hands‑on mentorship, regular feedback, and access to senior technical leaders through talks and Q&A sessions. You’ll also participate in GenAI and LLM‑focused POCs (including embeddings, RAG, and vector databases), engage with a strong intern community through tech talks and demo days, and finish the program with a portfolio‑ready demo and readout tied to measurable KPIs. Strong performance during the internship may lead to consideration for a full‑time role after graduation, subject to business needs.

About the Role

We are looking for a highly motivated Data Science Intern who is passionate about building data-driven products and exploring end‑to‑end data pipelines. This role is ideal for someone eager to work on modern data engineering, analytics, and AI/ML technologies while contributing to real business use cases. You will collaborate with our data engineering and data science teams to build scalable data solutions, clean and transform datasets, create analytical models, and experiment with the latest technologies in cloud, big data, and generative AI.

Key Responsibilities

Work with the team to design, build, and optimize data pipelines on cloud platforms.
Perform data cleaning, transformation, and feature engineering on large datasets.
Support development of ML models, experimentation, and evaluation metrics.
Assist in building dashboards, reporting, and data visualization assets.
Collaborate on ETL/ELT pipelines using modern data engineering tools.
Participate in POCs involving GenAI, LLMs, and MLOps frameworks.
Write clean, reusable, and efficient code using Python, PySpark, and SQL.
Document datasets, models, pipelines, and workflow processes.

Required Skills (Must‑Have)

Programming & Data – Python (Pandas, NumPy, scikit‑learn), SQL for data querying and optimization, PySpark for distributed data processing, understanding of relational databases (MySQL, PostgreSQL, SQL Server, etc.).
Data Engineering Tools – Databricks (Spark‑based data pipelines, notebooks, ML runtime), Snowflake (data warehousing; Snowpark exposure is a plus).
Cloud Platforms (any one or more) – Microsoft Azure (ADF, Synapse, Delta Lake, ML Studio), AWS (Glue, S3, Redshift, EMR, SageMaker), GCP (BigQuery, Dataflow, Dataproc, Vertex AI).
AI & ML Knowledge – Understanding of ML algorithms and model development workflow; basic knowledge of Large Language Models (LLMs) and GenAI concepts; experience with Jupyter Notebooks and ML experimentation tools.

Preferred / Good‑to‑Have Skills

Experience with Delta Lake and Lakehouse architecture.
Exposure to MLOps tools like MLflow, Kubeflow, Vertex Pipelines.
Knowledge of streaming platforms like Kafka.
Basic understanding of data modeling, star schema, and ETL design patterns.
Exposure to vector databases (Pinecone, ChromaDB, FAISS) for GenAI projects.
Familiarity with prompt engineering, embeddings, or RAG architecture.

Education

Master’s candidates who have recently completed their final year in Computer Science, Data Science, AI/ML, or related fields.
Relevant certifications in cloud, data engineering, or AI/ML are a plus.

Work Authorization

Only U.S. citizens or U.S. permanent residents (Green Card holders) will be considered for this role.