Artificial Intelligence (AI) and Large Language Models (LLMs) are transforming how software is tested and validated. As organizations adopt AI-driven applications, the role of a specialized AI QA Engineer is emerging as one of the most in-demand careers in the QA domain.
This 40-hour course provides a comprehensive and hands-on introduction to testing AI, ML, and LLM-based systems. Designed for freshers and QA professionals, it bridges traditional software testing with the new world of probabilistic AI systems.
This course is ideal for:
Freshers or QA professionals looking to upskill into AI QA.
Manual and automation testers who want to learn AI and LLM testing.
Professionals involved in testing AI, ML, or data-driven products.
No prior experience in AI or ML is required.
Understanding of basic software testing concepts is an advantage.
All necessary AI/ML testing foundations will be taught from scratch during the course.
AI, ML, and Deep Learning distinctions • Supervised, Unsupervised, Reinforcement learning • Common algorithms & testing implications • Data lifecycle, model training/validation, overfitting/underfitting • MLOps basics from QA lens • Comparing traditional QA vs. AI QA mindset • Hands-on: build a simple classification model in Scikit-learn + data pre-processing in Pandas
Testing ingestion, transformation, and output stages • Schema validation; nulls, outliers, duplicates • Record-level vs aggregate validation • Testing joins, aggregations, transformations • Tools: Great Expectations, Pandera, SQL-based checks • Synthetic test data & augmentation • Hands-on: validate sample datasets using Great Expectations; write SQL-based quality checks
Key evaluation metrics: accuracy, precision, recall, F1, confusion matrix • Drift detection: data drift, concept drift • Bias, fairness, and ethical evaluation • Black-box vs white-box testing for models • Hands-on: compute metrics manually in Python; analyze confusion matrix; perform a bias/fairness test on a sample model
• Challenges in LLM/RAG evaluation: hallucinations, grounding, relevance, factuality • What to test: prompts, outputs, grounding, consistency • Metrics: BLEU, ROUGE, BERTScore, faithfulness, toxicity, helpfulness • Local LLM setup with Ollama: installation, performance, prompt tuning • DeepEval framework: writing evaluators (StringMatchEvaluator, ContextualEval, ToxicityEval, etc.) • RAGAS evaluation: context precision/recall, faithfulness, answer correctness; integrating with LangChain/LlamaIndex • Hands-on: run local LLM via Ollama, write DeepEval tests, conceptually explore RAGAS on a sample RAG pipeline
• Automating data-driven tests (Python + PyTest/Robot + Pandas) • Testing AI/ML APIs (REST, GraphQL) with Postman etc. • CI/CD integration: GitHub Actions, Jenkins, model versioning (MLflow) • Cloud / MLOps: AWS SageMaker, GCP Vertex AI, Azure ML workflows • Monitoring & logging: CloudWatch, ELK, Prometheus, Grafana; drift detection in production • Responsible AI: explainability (SHAP, LIME), adversarial robustness, bias audits • Hands-on: build a CI pipeline to run model tests, explore SHAP/LIME explainability, discuss a real-world bias detection case study