Sitemap
How to Complete the Sitemap Scraping Form?
Looking at this Sitemap loader form, here's what you need to do to complete it:

Step-by-Step Instructions:
1. Name (Required)
Enter a name for your data source in the text box
This helps you identify your sitemap content later
Example: "Company Website Content," "Blog Articles," or "Product Pages"
The placeholder shows "My website sitemap" as an example
⚠️ This field is required (red warning shown)
2. Site map URL (Required)
Enter the URL of the website's sitemap
Common sitemap URLs:
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/robots.txt
(to find sitemap location)
The placeholder shows
https://example.com/sitemap.xml
as an example⚠️ This field is required (red warning shown)
3. Chunk Size
Set how many tokens or characters should be in each chunk
Default value is 1024 (recommended for most cases)
You can change this number if needed
This controls how the website content is divided for processing
4. Metadata (Optional)
Add extra information about the website content
This helps the AI better understand the scraped content
Example: "E-commerce product pages from 2024" or "Technical blog posts"
You can leave this empty if you don't have specific metadata
5. Text Splitter (Optional)
Default is set to "markdown"
This controls how the website text will be divided for processing
Usually, you don't need to change this setting
6. Select Embedding Model
Default "OpenAI - Text Embedding 3 Small" is already selected
You can change this by clicking the dropdown if needed
This model processes and understands your website content
7. Review Cost Warning
⚠️ Important: Check the yellow warning box
Data import costs 0.035 tokens per word
This is very cost-effective compared to other scraping methods
8. Final Step
Click the "Save" button at the bottom
The system will read the sitemap and scrape all listed pages
The content will be ready for AI to work with
Key Features:
Automated Discovery: Uses the sitemap to find all pages automatically
Comprehensive Coverage: Scrapes all pages listed in the sitemap
Cost Effective: Very affordable at 0.035 tokens per word
Efficient Processing: Processes website content based on sitemap structure
How to Find a Website's Sitemap:
Try common URLs:
https://website.com/sitemap.xml
https://website.com/sitemap_index.xml
https://website.com/sitemap/
Check robots.txt:
Go to
https://website.com/robots.txt
Look for a line like:
Sitemap: https://website.com/sitemap.xml
Search engines:
Google: Search
site:website.com filetype:xml
This can help find XML sitemaps
What Gets Scraped:
All pages listed in the sitemap
Page titles and headings
Main text content
Article content
Product descriptions
Any readable text on the pages
Perfect For:
Complete website content extraction
Blog post collections
E-commerce product catalogs
Documentation sites
News websites
Any website with a properly structured sitemap
Advantages of Using Sitemaps:
Efficient: Automatically finds all important pages
Complete: Doesn't miss pages that might not be linked
Organized: Follows the website's own structure
Fast: More efficient than crawling page by page
Simple Summary:
Give your sitemap scraping project a name
Enter the sitemap URL (usually ends with .xml)
Adjust settings if needed (or keep defaults)
Add metadata about the content (optional)
Click Save
Done!
The system will read the sitemap file, discover all the pages listed in it, scrape the content from each page, and convert it into searchable text that AI can understand and work with for analysis, questions, and insights!
Last updated