Sitemap

How to Complete the Sitemap Scraping Form?

Looking at this Sitemap loader form, here's what you need to do to complete it:

Step-by-Step Instructions:

1. Name (Required)

  • Enter a name for your data source in the text box

  • This helps you identify your sitemap content later

  • Example: "Company Website Content," "Blog Articles," or "Product Pages"

  • The placeholder shows "My website sitemap" as an example

  • ⚠️ This field is required (red warning shown)

2. Site map URL (Required)

  • Enter the URL of the website's sitemap

  • Common sitemap URLs:

    • https://example.com/sitemap.xml

    • https://example.com/sitemap_index.xml

    • https://example.com/robots.txt (to find sitemap location)

  • The placeholder shows https://example.com/sitemap.xml as an example

  • ⚠️ This field is required (red warning shown)

3. Chunk Size

  • Set how many tokens or characters should be in each chunk

  • Default value is 1024 (recommended for most cases)

  • You can change this number if needed

  • This controls how the website content is divided for processing

4. Metadata (Optional)

  • Add extra information about the website content

  • This helps the AI better understand the scraped content

  • Example: "E-commerce product pages from 2024" or "Technical blog posts"

  • You can leave this empty if you don't have specific metadata

5. Text Splitter (Optional)

  • Default is set to "markdown"

  • This controls how the website text will be divided for processing

  • Usually, you don't need to change this setting

6. Select Embedding Model

  • Default "OpenAI - Text Embedding 3 Small" is already selected

  • You can change this by clicking the dropdown if needed

  • This model processes and understands your website content

7. Review Cost Warning

  • ⚠️ Important: Check the yellow warning box

  • Data import costs 0.035 tokens per word

  • This is very cost-effective compared to other scraping methods

8. Final Step

  • Click the "Save" button at the bottom

  • The system will read the sitemap and scrape all listed pages

  • The content will be ready for AI to work with

Key Features:

  • Automated Discovery: Uses the sitemap to find all pages automatically

  • Comprehensive Coverage: Scrapes all pages listed in the sitemap

  • Cost Effective: Very affordable at 0.035 tokens per word

  • Efficient Processing: Processes website content based on sitemap structure

How to Find a Website's Sitemap:

  1. Try common URLs:

    • https://website.com/sitemap.xml

    • https://website.com/sitemap_index.xml

    • https://website.com/sitemap/

  2. Check robots.txt:

    • Go to https://website.com/robots.txt

    • Look for a line like: Sitemap: https://website.com/sitemap.xml

  3. Search engines:

    • Google: Search site:website.com filetype:xml

    • This can help find XML sitemaps

What Gets Scraped:

  • All pages listed in the sitemap

  • Page titles and headings

  • Main text content

  • Article content

  • Product descriptions

  • Any readable text on the pages

Perfect For:

  • Complete website content extraction

  • Blog post collections

  • E-commerce product catalogs

  • Documentation sites

  • News websites

  • Any website with a properly structured sitemap

Advantages of Using Sitemaps:

  • Efficient: Automatically finds all important pages

  • Complete: Doesn't miss pages that might not be linked

  • Organized: Follows the website's own structure

  • Fast: More efficient than crawling page by page

Simple Summary:

  1. Give your sitemap scraping project a name

  2. Enter the sitemap URL (usually ends with .xml)

  3. Adjust settings if needed (or keep defaults)

  4. Add metadata about the content (optional)

  5. Click Save

  6. Done!

The system will read the sitemap file, discover all the pages listed in it, scrape the content from each page, and convert it into searchable text that AI can understand and work with for analysis, questions, and insights!

Last updated