I am a machine learning researcher and founder passionate about advancing generative AI, synthetic data generation, and ethical AI systems. As Co-Founder of tabularis.ai, we working on AI solutions to:

  • Build highly realistic synthetic data for training efficient AI models, low-resource or large scalle
  • Specilized AI models (e.g., 23-language Multilingual Sentiment Analysis Model that has 500,000+ monthly downloads)
  • Tabular data, Safety AI, DPO/GRPO, AI agents

I hold a PhD in Machine Learning & Computer Science from the University of Tuebingen, where my research focused on explainability, inference, and generative models for tabular and textual data. My interdisciplinary work spans:

  • Generative AI / Large Languarge Models
  • Synthetic Data (Published a package that is used by Google (Kaggle), AWS, and many more, it is in top 10% of all pip packages)

I’m actively seeking collaborators, interns, and thesis students passionate about pushing boundaries in LLMs, synthetic data, and tabular machine learning.

🎓 Internships & Thesis Projects

Work on cutting-edge problems like:

  • Synthetic Data Engineering: Build tools for synthetic generation/ build the best synthetic datasets.
  • Specialized AI models: Currently we are looking for specilized LLMs and embedding models.

📬 Reach out via email or connect on LinkedIn.

News

  • [05/06/2026] We published tuetoken, a fast tokenizer backend for LLMs, up to 30× faster than tiktoken or Hugging Face tokenizers.
  • [06/02/2026] We released Faust-1, a 1.6B-parameter German language model trained from scratch, achieving competitive performance while remaining efficient enough to run on consumer hardware.
  • [02/01/2026] Our paper “Do Chatbot LLMs Talk Too Much? The YapBench Benchmark” was published on arXiv, introducing a benchmark for measuring verbosity and over-generation in chatbot LLMs. arXiv
  • [01/08/2024] Co-organize a NeurIPs 2024 workshop on tabular data learning. For more details, pleae visit: TBL workshop website
  • [14/06/2024] Our new paper on large scalle synthetic data generation using open-source LLMs is accepted to the Data-centric Machine Learning Research workshop at ICML 2024. arxiv

📅 Book a Free Consultation

Ready to elevate your project with AI/ML expertise? Schedule a 30-minute consultation to:

  • Discuss your project’s vision, objectives, and challenges.
  • Explore high-level deep-learning opportunities and roadmaps.
  • Determine if my services align with your needs.

Please note that this session provides an overview, not in-depth technical advice or solutions. If we decide to work together, we can create a tailored plan that addresses your project’s unique challenges and goals. To schedule a consultation or for any inquiries, please email me at vadim@tabularis.ai 📩.