You are viewing a preview of this job. Log in or register to view more details about this job.

Ph.D. Research Intern (Vision and Physical AI)

Internship: Vision AI / VLM / Physical AI (Ph.D. Research Intern) 

 

Company: Centific 

Location: Seattle, WA (or Remote) 

Type: Fulltime Internship 

 

Build the Future of Perception & Embodied Intelligence 

 

Are you pushing the frontier of computer vision, multimodal large models, and embodied/physical AI—and have the publications to show it? Join us to translate cuttingedge research into production systems that perceive, reason, and act in the real world. 

 

The Mission 

 

We are building stateoftheart Vision AI across 2D/3D perception, egocentric/360° understanding, and multimodal reasoning. As a Ph.D. Research Intern, you will own highleverage experiments from paper → prototype → deployable module in our platform. 

 

What You’ll Do 

Advance Visual Perception: Build and finetune models for detection, tracking, segmentation (2D/3D), pose & activity recognition, and scene understanding (incl. 360° and multiview). 

Multimodal Reasoning with VLMs: Train/evaluate vision–language models (VLMs) for grounding, dense captioning, temporal QA, and tooluse; design retrievalaugmented and agentic loops for perceptionaction tasks. 

Physical AI & Embodiment: Prototype perceptionintheloop policies that close the gap from pixels to actions (simulation + real data). Integrate with planners and task graphs for manipulation, navigation, or safety workflows. 

Data & Evaluation at Scale: Curate datasets, author highsignal evaluation protocols/KPIs, and run ablations that make results irreproducible impossible

Systems & Deployment: Package research into reliable services on a modern stack (Kubernetes, Docker, Ray, FastAPI), with profiling, telemetry, and CI for reproducible science. 

Agentic Workflows: Orchestrate multiagent pipelines (e.g., LangGraphstyle graphs) that combine perception, reasoning, simulation, and codegeneration to selfcheck and selfcorrect. 

 

Example Problems You Might Tackle 

Longhorizon video understanding (events, activities, causality) from egocentric or 360° video. 

3D scene grounding: linking language queries to objects, affordances, and trajectories. 

Fast, privacypreserving perception for ondevice or edge inference. 

Robust multimodal evaluation: temporal consistency, openset detection, uncertainty. 

Visionconditioned policy evaluation in sim (Isaac/MuJoCo) with sim2real stress tests. 

 

Minimum Qualifications 

Ph.D. student in CS/EE/Robotics (or related), actively publishing in CV/ML/Robotics (e.g., CVPR/ICCV/ECCV, NeurIPS/ICML/ICLR, CoRL/RSS). 

Strong PyTorch (or JAX) and Python; comfort with CUDA profiling and mixedprecision training. 

Demonstrated research in computer vision and at least one of: VLMs (e.g., LLaVAstyle, videolanguage models), embodied/physical AI, 3D perception

Proven ability to move from paper → code → ablation → result with rigorous experiment tracking. 

 

Preferred Qualifications 

Experience with video models (e.g., TimeSFormer/MViT/VideoMAE), diffusion or 3D GS/NeRF pipelines, or SLAM/scene reconstruction

Prior work on multimodal grounding (referring expressions, spatial language, affordances) or temporal reasoning

Familiarity with ROS2, DeepStream/TAO, or edge inference optimizations (TensorRT, ONNX). 

Scalable training: Ray, distributed data loaders, sharded checkpoints. 

Strong software craft: testing, linting, profiling, containers, and reproducibility. 

Public code artifacts (GitHub) and firstauthor publications or strong opensource impact. 

 

Our Stack (you’ll touch a subset) 

Modeling: PyTorch, torchvision/lightning, Hugging Face, OpenMMLab, xFormers 

Perception: YOLO/Detectron/MMDet, SAM/Mask2Former, CLIPstyle backbones, optical flow 

VLM / LMM: Vision encoders + LLMs, RAG for video, toolformer/agent loops 

3D / Sim: Open3D, PyTorch3D, Isaac/MuJoCo, COLMAP/SLAM, NeRF/3DGS 

Systems: Python, FastAPI, Ray, Kubernetes, Docker, Triton/TensorRT, Weights & Biases 

Pipelines: LangGraphlike orchestration, data versioning, artifact stores 

 

What Success Looks Like 

A publishable or opensourced outcome (with company approval) or a productionready module that measurably moves a product KPI (latency, accuracy, robustness). 

Clean, reproducible code with documented ablations and an evaluation report that a teammate can rerun endtoend. 

A demo that clearly communicates capabilities, limits, and next steps. 

 

Why Centific 

Real impact: Your research ships—powering core features in our MVPs and products. 

Mentorship: Work closely with our Principal Architect and senior engineers/researchers. 

Velocity + Rigor: We balance toptier research practices with pragmatic product focus. 

 

How to Apply 

 

Email your CV, publication list/Google Scholar, and GitHub (or artifacts/videos) to <Centific email> with the subject line: 

“Vision AI / VLM / Physical AI – Ph.D. Research Intern”

Optionally include a 1-page Problem Statement describing a research idea you’d like to execute in 12–16 weeks (scope, data, evals, milestones). 

 

Short Version (for social/quick post) 

 

Centific is hiring a Ph.D. Research Intern in Vision AI / VLM / Physical AI (Seattle or Remote). Work on 2D/3D perception, multimodal grounding, and embodied intelligence. Ship research to production with PyTorch, Ray, and Kubernetes. Strong publication record required. Email CV + pubs + GitHub to <Centific email> with subject: Vision AI / VLM / Physical AI – Ph.D. Research Intern