Data Science Intern
Responsibilities/ Core Projects/Tasks:
· Content & Data Quality for AI Search
- Audit and improve chunk quality across OpenSearch indexes
- Identify and fix missing or low‑quality metadata via scripts
· Embedding & Search Evaluation
- Evaluate and compare embedding models for retrieval quality
- Build small evaluation datasets (Q&A pairs) to benchmark results
· LLM‑Powered Enrichment Pipelines
- Assist in tuning pipelines that extract summaries, keywords, and tags
- Help monitor enrichment coverage and quality
· Observability & Reporting
- Create scripts or notebooks to report on index health and enrichment status
What the Intern Will Learn:
- How production RAG (Retrieval-Augmented Generation) systems are assembled, evaluated, and iterated
- Practical vector search fundamentals (indexing, chunking, metadata, and relevance tuning) in OpenSearch
- LLM-powered data enrichment pipelines at scale
- Exposure to graph concepts and applied data modeling (reviewing existing Cypher/Memgraph queries and contributing small improvements)
- ETL pipeline development for scientific datasets
- AI search evaluation methodology and benchmarking
- Working with large technical document collections (Handbooks, magazines, products, videos)
Requirements:
- Senior undergraduate (final-year) or graduate student in Data Science
- Python proficiency (production-level scripting)
- Familiarity with data manipulation (pandas, JSON, REST APIs)
- Genuine interest in AI and language models
- Curiosity about materials science or scientific data – no domain expertise required
- Comfort working in a Linux/Docker environment
Nice to Have:
- Experience with vector databases (OpenSearch, Pinecone, Weaviate, Chroma, etc.)
- Exposure to LangChain, LLM APIs, or RAG patterns
- Experience with graph databases (Neo4j, Memgraph, Cypher)
- Familiarity with NLP libraries (Hugging Face transformers, spaCy)