Categories Machine Learning

I Connected 3 MCP Servers to Claude & Built a No-Code Research Agent That Actually Cites Sources

Step 1: Check Memory Before Doing Anything

Here’s how we’ll start.

I want to research [TOPIC]. Execute this workflow:

### Step 1 - Check Existing Knowledge
Use the Knowledge Graph MCP (`search_nodes`) to check if there are already [TOPIC]-related entities or sources in memory.
- Query with key terms related to your research topic
- Return any existing nodes and relations (papers, books, articles, experts, concepts, institutions)

// more to come later

This seems obvious, but it’s solving two problems at once.

  • I’m not burning tool calls re-fetching data I already have.
  • I’m building on previous research instead of treating every query like a blank slate.

💡 Pro Tip: Use phrasing like “[TOPIC]-related” instead of exact matches. Academic papers use different terminology for the same concepts. Letting the model use a fuzzy search will catch related work that might use “machine learning” vs “artificial intelligence” vs “deep learning” for overlapping research.

That search_nodes tool is part of the Knowledge graph MCP. We’ll look at how it works (and how it helps us) in a bit. For now, just think of this first part of the prompt like a cache we check first for hits. Only on a miss do we continue with the actual workflow.

Step 2 & 3: Domain-Aware Source Routing

Instead of searching everything (which is wasteful, and prone to poisoning) I built three distinct research pipelines to route queries to specialized endpoints based on the topic being searched, and what it needs:

  • Track A: Biomedical Gold Standardbiomcp → PubMed, FDA data, clinical trials, pharma data, and more.
  • Track B: STEM Cutting Edgesimple-arxiv → Preprints 6–18 months ahead
  • Track C: General purpose researchBright Data → Anything else, plus broader context as a follow up

This conditional is a good demonstration of how we leverage the actual LLM as the glue between our 3 MCP servers. We first let our LLM (Sonnet 4, here) classify the topic, then tell it the right tools to use + how to use them, based on that classification.

### Step 2 : Classify the topic
Before anything, check what kind of topic this is.

### STEP 3 : Find Latest Information
- Does the topic relate to medicine, genetics, pharmaceuticals, or clinical research? Then:
- Use biomcp MCP server to find recent peer-reviewed articles, clinical trials, and genetic variants
- Get the 5 most recent papers
- If any of the papers or journals retrieved does not have a brief, or full content:
- Use Bright Data's scrape_as_markdown to extract full content from its PMID link
- If you didn’t find a PMID link, use Bright Data’s search_engine to query for the DOI. Then, use scrape_as_markdown to extract that page’s full content.

- Does the topic relate to mathematics, physics, astronomy, electrical engineering, computer science, quantitative biology, statistics, mathematical finance, or economics? Then:
- Use simple-arxiv MCP server to search for preprints
- Get the 5 most recent papers
- For any of those arXiv paper found:
- Use Bright Data's scrape_as_markdown to extract content of its arXiv full text HTML link

- Else (or if the previous two tracks didn't yield anything relevant):
- Use Bright Data's search_engine to find recent and authoritative sources on the topic from the last 6-12 months:
- Search academic databases, news sources, and authoritative websites
- Use Bright Data's scrape_as_markdown to extract content of the top 5 most relevant results
- Prioritize: peer-reviewed papers, reputable publications, expert analyses, official reports
- Collect these metadata: title, authors, publication date, URL, source type, publication venue

// more to come later

TL;DR: Explicitly state the “strength” and “risk” of each track — this primes your model to understand the credibility hierarchy. It’s a good-enough default to follow.

I’m explicitly trading off recency vs. credibility based on the research domain. Change this based on your specific needs, or if you want to fetch more than just the most recent paper (I get 5 at a time in the production version of this prompt).

I did it this way because the most relevant information will vary depending on the field. in biomedical research, there’s a clear hierarchy: systematic reviews beat RCTs beat cohort studies beat case reports. But for cutting-edge AI research, for example, the most important work often appears on arXiv months before journal publication.

The rest of the flow is pretty self-explanatory — I’m exploiting the domain-aware MCPs to get the full-text links to papers/journals, and then using Bright Data to fetch their full contents as markdown. Geoblocks, dynamic content, CAPTCHA are resolved automatically.

Step 4 — Deduplication: Keeping the Graph Clean

💡 If all you need is one-off searches with no persistent store or follow-up questions, this step is optional. You’re already done. Skip to Step 6, run the prompt, enjoy!

So we know we’re going to need a knowledge graph to model and store this raw data.

But if you’ve ever tried building a research assistant, you know duplicates are the silent killer.

  • The same paper can show up in arXiv and PubMed.
  • Authors’ names can be spelled differently across datasets.
  • A single work can generate multiple nodes if you don’t normalize.

Every duplicate fragments the graph and breaks the intelligence layer — you end up with “half-truths” scattered across nodes instead of a single connected entity. So before doing anything with the knowledge graph, we have to fix this.

Why Deduplication Matters

Imagine storing research for the Mona Lisa under three separate nodes:

  • source__mona_lisa
  • source__la_gioconda
  • source__paintings_by_da_vinci

Your assistant wouldn’t know they’re the same painting. Citations, relationships, and insights would all be split apart. That’s what happens if we don’t normalize.

The Canonical ID Strategy

To avoid this, I enforce deterministic, human-readable IDs. A few examples:

| Case | Canonical ID Example | Notes |
| :--- | :--- | :--- |
| PubMed journal article | `source__nejm_2023_gene_editing` | Venue + year + title words |
| arXiv preprint | `source__arxiv_2021_2101.12345` | Use official arXiv ID |
| arXiv → journal | Link with `precedes_publication` | Don’t collapse into one node |
| News article | `source__nyt_2024_crispr_breakthrough` | Outlet + year + headline words |
| Author | `expert__jane_doe_oncology` | Name + domain disambiguation | human-readable, so when I’m debugging why two papers didn’t link properly, I can actually read the IDs and figure out what went wrong.

TL;DR: Every claim must resolve to one canonical source node, not spawn a new duplicate. These rules aren’t perfect, but they prevent most of the fragmentation.

Here’s the full prompt I use. This should give you enough to be able to design one based on your own needs.

### Step 4 - Deduplication & Canonical Matching

**1. Generate Deterministic IDs:**
For each entity type, create stable IDs using this format:
- **Sources**: `source__[venue_slug]_[year]_[first_3_words_of_title]`
- Academic: "source__nature_2024_attention_mechanisms_survey"
- arXiv: "source__arxiv_2024_neural_memory_architectures"
- News: "source__nyt_2024_ai_breakthrough_reported"
- **Concepts**: `concept__[concept_name_lowercase_underscored]`
- Example: "concept__transformer_attention_mechanisms"
- **Domains**: `domain__[field_name_lowercase_underscored]`
- Example: "domain__machine_learning" or "domain__theoretical_physics"
- **Experts**: `expert__[last_name]_[first_name]_[specialization]`
- Example: "expert__lecun_yann_deep_learning"

**2. Check for Existing Duplicates:**
Before creating any entity, use `search_nodes` to check if it already exists:
- Search by the generated canonical ID
- Search by exact title match
- Search by URL, DOI, or arXiv ID (for academic sources)
- Search by author name and year (for expert entities)
- **Special for arXiv**: Check if the same paper later appeared in a journal (cross-reference by authors + similar title)

**3. Handle Duplicates:**
- **If exact ID found**: Skip creation, instead add new observations to existing entity using `update_entity`
- **If similar title found** (manually assess if >80% similar): Update existing entity rather than create new one
- **For arXiv→Journal transitions**: Link as related papers rather than duplicating
- **If no match found**: Create new entity with the canonical ID

Which brings us to the knowledge graph — let’s quickly give you a primer on that next.

TL;DR: Knowledge Graph

What it is: A local graph database provided by our MCP server that gives Claude persistent memory for research data.

Three core components:

  1. Entities are your core nodes — anything worth remembering gets an entity.
{
"name": "source__nature_2024_crispr_safety",
"entityType": "research_source",
"observations": [
"peer_review_status: peer_reviewed",
"evidence_level: systematic_review",
"credibility_level: high"
]
}

2. Observations are atomic facts attached to entities (as strings). Instead of dumping everything into a description field, each discrete piece of information gets its own observation. This makes the knowledge queryable — I can ask “show me all high-credibility sources on CRISPR safety” and get precise results.

3. Relations connect entities in meaningful ways:

{
"from": "expert__doudna_jennifer_crispr",
"to": "source__nature_2024_crispr_safety",
"relationType": "authored_paper"
}

Why it matters:

Instead of treating every research query like a blank slate, Claude builds up searchable knowledge over time. You can ask “show me all high-credibility CRISPR papers” and get precise results based on stored metadata, not hallucinations.

The payoff:

Runs locally, gets smarter with use, enables complex follow-up questions like “which papers cite both study A and study B?”

Got all that? Great. Let’s move on with the tutorial.

Step 5 — Knowledge Graph Schema

TL;DR: Once duplicates are under control, the next step is deciding what kinds of entities and relationships we actually store. This schema is what makes the assistant more than just a search tool.

The Core Graph (Minimal Viable Setup)

This can be as simple or complex as you want. I figure you only need three entity types to start:

  • Sources → papers, preprints, news articles
  • Concepts → ideas, techniques, diseases, compounds
  • Experts → authors, labs, organizations

And three relationships:

  • cites (source → source)
  • mentions (source → concept)
  • authored_by (source → expert)

That’s enough to ask useful follow-up questions like:

  • “Which papers cite both study A and study B?”
  • “Who are the top authors on compound Z?”
  • “Which concepts link protein X to disease Y?”

So your minimal viable prompt for this would look like:

### Step 5 - Create/Update Core Graph Entities

**A. Source Entity**
- `id`: "source__[venue_or_platform]_[year]_[short_title]"
- `name`: Human-readable title
- `entityType`: "research_source"
- `observations`:
- "source_type: {paper|preprint|news}"
- "title: ..."
- "authors: ..."
- "url: ..."

**B. Concept Entity**
- `id`: "concept__[normalized_concept_name]"
- `name`: "{Concept_Name}"
- `entityType`: "concept"
- `observations`:
- "basic_description: short explanation of this concept"

**C. Expert Entity**
- `id`: "expert__[lastname_firstname]"
- `name`: "{Expert_Name}"
- `entityType`: "expert"
- `observations`:
- "affiliation: {institution if known}"

### Step 5B - Relationships
Always use canonical IDs:
- `source__arxiv_2024_attention → cites → source__nature_2023_transformers`
- `source__arxiv_2024_attention → mentions → concept__attention_mechanisms`
- `source__arxiv_2024_attention → authored_by → expert__vaswani_ashish`

And you’re done. You could use this and be fine for most use cases (and you can move on to Step 6).

For a much more useful research assistant, however, you’d need something more. Here’s what I do:

[OPTIONAL] The Extended Graph For Better Analysis

For deeper research intelligence, I add many other optional fields and relationships:

  • Domains (e.g. medicine, physics) → for query routing
  • Observations on sources:
  • credibility_level (peer-reviewed, preprint, news)
  • research_phase (preclinical, clinical trial, published application)
  • controversy_level (low, medium, high)
  • Temporal links: precedes_publication (connect preprint → final journal version)
  • Expert influence: collaborated_with, affiliated_with
  • And more.

Here’s how I implement mine. Use this as a template to design your own.

### Step 5 - Create/Update Entities with Enhanced Structure

**A. Create/Update Source Entity:**
- `id`: Use canonical format: `source__[venue]_[year]_[title_keywords]`
- `name`: Human-readable: "{Topic}_{SubArea}_{ContentType}_{Year}"
- `entityType`: "research_source"
- `observations`:
- `"source_type: {pubmed_paper|preprint|clinical_trial|news_article|book|report|interview|patent}"`
- `"peer_review_status: {peer_reviewed|preprint_not_reviewed|under_review|clinical_protocol|unknown}"`
- `"methodology: {experimental_study|clinical_trial|theoretical_analysis|literature_review|case_study|survey|meta_analysis}"`
- `"research_phase: {basic_research|preclinical|phase1|phase2|phase3|approved|post_market}"`
- `"evidence_level: {systematic_review|rct|cohort_study|case_series|expert_opinion|preliminary}"`
- `"topic_focus: {specific_subtopic_or_theme}"`
- `"credibility_level: {high|medium|low}"`
- `"recency_relevance: {cutting_edge|current|historical|timeless}"`
- `"content_depth: {comprehensive|overview|specialized|introductory}"`
- `"research_maturity: {preliminary_results|established_findings|mature_field|speculative}"`
- `"citation_risk: {high_novelty_unvalidated|medium_risk|well_established|controversial}"`
- **For PubMed**: `"pmid: [PMID]"`, `"doi: [DOI]"`, `"journal_impact: [high|medium|low]"`
- **For arXiv**: `"arxiv_id: [arXiv:XXXX.XXXXX]"`, `"submission_date: [date]"`
- **For Clinical Trials**: `"nct_number: [NCT########]"`, `"trial_phase: [I|II|III|IV]"`, `"enrollment_size: [number]"`
- any other relevant observation about this source extracted from this source
- Standard metadata: `"title: ..."`, `"authors: ..."`, `"url: ..."`, etc.

**B. Create/Update Concept Entity:**
- `id`: Use canonical format: `concept__[normalized_concept_name]`
- `name`: "{Concept_Name}"
- `entityType`: "concept"
- `observations`:
- `"concept_maturity: {emerging|developing|established|mature|declining}"`
- `"theoretical_vs_practical: {purely_theoretical|mixed|highly_practical}"`
- `"complexity_level: {basic|intermediate|advanced|expert}"`
- `"research_velocity: {rapidly_evolving|steady_progress|slow_development|stagnant}"`
- `"controversy_level: {consensus|minor_debates|major_debates|highly_controversial}"`
- `"interdisciplinary_scope: {single_field|cross_disciplinary|broadly_applicable}"`
- any other relevant observation about this concept extracted from this source

**C. Create/Update Domain Entity:**
- `id`: Use canonical format: `domain__[normalized_field_name]`
- `name`: "{Field_Name}"
- `entityType`: "knowledge_domain"
- `observations`:
- `"field_maturity: {emerging|established|mature|transforming}"`
- `"arxiv_category: {cs.AI|math.NA|physics.ML|econ.TH|etc}"`
- `"research_activity_level: {very_high|high|medium|low}"`
- `"commercial_relevance: {high|medium|low|academic_only}"`
- `"breakthrough_potential: {revolutionary|significant|incremental}"`
- `"methodology_preferences: {theoretical|experimental|computational|empirical}"`
- any other relevant observation about this domain extracted from this source

**D. Create/Update Expert Entity:**
- `id`: Use canonical format: `expert__[lastname_firstname_specialization]`
- `name`: "{Expert_Name}_{Specialization}"
- `entityType`: "expert_authority"
- `observations`:
- `"expertise_level: {world_leading|highly_recognized|established|emerging}"`
- `"institutional_affiliation: {university|tech_company|research_lab|independent}"`
- `"publication_pattern: {frequent_arxiv|traditional_journals|mixed|books_primarily}"`
- `"influence_type: {theoretical_breakthroughs|practical_applications|both}"`
- `"collaboration_network: {highly_collaborative|selective|independent}"`
- any other relevant observation about this expert extracted from this source

**E. Create Enhanced Relationships:**
Always reference entities by their canonical IDs:
- `source__arxiv_2024_attention → explores_concept → concept__attention_mechanisms`
- `source__arxiv_2024_attention → precedes_publication → source__nature_2025_attention` (if same work published later)
- `expert__lecun_yann_deep_learning → frequently_publishes_on_arxiv → domain__machine_learning`
- `concept__transformer_attention → rapidly_evolving_in → domain__natural_language_processing`
- `source__arxiv_2024_paper_a → builds_upon → source__arxiv_2024_paper_b` (citation relationships)
- any other relevant relationship extracted from this source

### Step 5B - Enhanced Deduplication Guidelines

**arXiv-Specific Deduplication:**
- Same paper on arXiv and later in journal: Create relationship `precedes_publication` rather than duplicate
- Multiple arXiv versions (v1, v2, v3): Update existing entity with version notes
- Conference papers that become arXiv preprints: Link as `related_work`

**Cross-Platform Matching:**
Consider equivalent:
- "Transformer Attention Mechanisms" (Nature) vs "Attention in Transformers" (arXiv)
- Same author set + similar core concepts + similar timeframe

**Author Disambiguation:**
- "Y. LeCun" vs "Yann LeCun" vs "Yann A. LeCun" → same expert entity
- Use institutional affiliation + research area to resolve ambiguity

Why This Matters

My expanded design makes Claude store a Source for a research topic (Say, “CRISPR”) like so:

{
"type": "entity",
"name": "CRISPR_Clinical_Trials_2025",
"entityType": "source",
"observations": [
"source_type: pubmed_paper",
"peer_review_status: peer_reviewed",
"evidence_level: systematic_review",
"research_phase: approved",
"credibility_level: high",
"citation_risk: well_established",
"Research type: comprehensive clinical translation review",
"Methodology: systematic analysis of CRISPR clinical trials and regulatory approvals",
"Target application: various genetic diseases and cancer therapies",
"Innovation level: established technology with expanding clinical applications",
"Clinical relevance: extremely high - covers approved therapies and ongoing trials",
"Research stage: clinical implementation and regulatory approval",
"Technical focus: base editing, prime editing, delivery systems, safety profiles",
"Model system: clinical trials across multiple disease areas",
"PMID: 40160040",
"DOI: 10.1017/erm.2024.32",
"Journal: Expert Reviews in Molecular Medicine",
"Publication date: 2025-03-31",
"Authors: Cetin B, Erendor F, Eksi YE, Sanlioglu AD, Sanlioglu S",
"Key findings: First FDA-approved CRISPR therapy (Casgevy); expanding clinical applications; ongoing challenges with delivery and off-target effects"
]
}

Thanks to our knowledge graph MCP server’s tools, now I can ask Claude “Show me all the high-risk citations in oncology research” and instantly retrieve sources tagged explicitly with citation_risk: high_novelty_unvalidated rather than letting the LLM decide.

This lets me query intelligently, with no hallucinations:

  • “What are the most credible, peer-reviewed papers on CRISPR safety?” → pulls things explicitly marked as credibility_level: high and peer_review_status: peer_reviewed.
  • “Show me the earliest, still-preliminary results on CRISPR applications.” → surfaces evidence_level: preliminary.
  • “Which Phase 2 clinical trials are testing new gene therapies?” → resolves exactly to research_phase: phase2.
  • “Find me preprints with medium credibility on microbiome–immune interactions.” → maps to source_type: preprint andcredibility_level: medium.

Similarly, for Concepts, interactions between gut bacteria and their human host would be depicted in my knowledge graph as:

{
"type": "entity",
"name": "concept__gut_microbiome_host_interactions",
"entityType": "concept",
"observations": [
"concept_maturity: developing",
"research_velocity: rapidly_evolving",
"controversy_level: minor_debates",
"theoretical_vs_practical: mixed",
"definition: The dynamic relationships between gut microbial communities and the human host's physiology",
"key_components: Bacteria, archaea, viruses, fungi, and their metabolic byproducts",
"research_significance: Central to understanding digestion, immune modulation, and metabolic health",
"scholarly_approach: Systems biology integrating microbiology, immunology, and metabolomics",
"experimental_evidence: Metagenomic sequencing, germ-free mouse models, fecal microbiota transplantation",
"historical_context: Recognition since early 20th century of gut bacteria's role in nutrient absorption",
"biomedical_implications: Links to obesity, diabetes, autoimmune disorders, and neurological diseases",
"contemporary_relevance: Microbiome-targeted interventions such as probiotics, prebiotics, and dietary therapies",
"methodological_innovation: Shotgun metagenomics and machine learning for microbial community analysis",
"clinical_impact: Emerging microbiome-based diagnostics and therapeutics influencing precision medicine"
]
}

And for Experts. Noted Renaissance expert Ann Pizzorusso would be depicted in my knowledge graph like this:

{
"type": "entity",
"name": "Ann Pizzorusso",
"entityType": "expert_authority",
"observations": [
"expertise_level: highly_recognized",
"institutional_affiliation: independent",
"publication_pattern: mixed",
"influence_type: theoretical_breakthroughs",
"collaboration_network: selective",
"specialization: Geological analysis of Renaissance art",
"notable_work: Tweeting Da Vinci (2014 book)",
"recent_discovery: Identification of Lecco as Mona Lisa background location",
"methodology: Combined geological and art historical analysis",
"previous_research: Analysis of Virgin of the Rocks paintings at Louvre and National Gallery"
]
}

By using this schema for modeling relationships in my knowledge graph, I’m no longer staring at a jumble of unstructured notes. I can immediately see which relationships were present in my raw research data.

For example, after I researched “Mona Lisa”, here’s what relationship entities looked like in my graph:

{"type":"relation","from":"source__smithsonian_2023_mona_lisa_chemical_analysis","to":"concept__mona_lisa_scientific_analysis","relationType":"contributes_to"}
{"type":"relation","from":"source__smithsonian_2023_mona_lisa_chemical_analysis","to":"expert__victor_gonzalez_chemist","relationType":"authored_by"}
{"type":"relation","from":"source__guardian_2024_mona_lisa_location_discovery","to":"expert__ann_pizzorusso_geologist_art_historian","relationType":"features_research_by"}
{"type":"relation","from":"source__guardian_2024_mona_lisa_location_discovery","to":"concept__mona_lisa_scientific_analysis","relationType":"contributes_to"}
{"type":"relation","from":"expert__victor_gonzalez_chemist","to":"domain__art_conservation_science","relationType":"works_in"}
{"type":"relation","from":"expert__ann_pizzorusso_geologist_art_historian","to":"domain__art_conservation_science","relationType":"works_in"}
{"type":"relation","from":"concept__mona_lisa_scientific_analysis","to":"concept__renaissance_painting_techniques","relationType":"reveals_insights_about"}
{"type":"relation","from":"concept__renaissance_painting_techniques","to":"domain__art_conservation_science","relationType":"studied_within"}
{"type":"relation","from":"source__npr_2025_mona_lisa_louvre_move","to":"concept__mona_lisa_scientific_analysis","relationType":"contextualizes"}

What This Unlocks:

Now, the LLM has concrete data to tell me that:

  • The Smithsonian 2023 chemical analysis paper contributes to the broader concept of Mona Lisa scientific analysis, and it is authored by Victor Gonzalez (chemist).
  • The Guardian 2024 discovery piece features research by Ann Pizzorusso (geologist + art historian) and also contributes to the same Mona Lisa scientific analysis concept.
  • Both Gonzalez and Pizzorusso are explicitly tied to the art conservation science domain, showing their shared disciplinary context.
  • The Mona Lisa scientific analysis concept itself is shown to reveal insights about Renaissance painting techniques, which in turn are studied within the art conservation science domain.
  • Even coverage like the NPR 2025 story about moving the painting is captured — it contextualizes the ongoing scientific analysis, giving me cultural and historical framing.

None of these are hallucinations. All answers are grounded in citable fact.

You May Also Like