Webpage

How to Complete the Webpage Scraping Form?

Looking at this Webpage loader form, here's what you need to do to complete it:

Step-by-Step Instructions:

1. Name (Required)

  • Enter a name for your data source in the text box

  • This helps you identify your webpage content later

  • Example: "Product Pages," "Article Collection," or "Documentation Pages"

  • The placeholder shows "Give a name" as an example

2. URLs (Required)

  • Enter one or more URLs separated by commas

  • Example: https://example.com, https://another-example.com

  • You can add multiple specific webpage URLs

  • ⚠️ This field is required (red warning shown)

3. Fetch Content from Page Sub-URLs?

  • Toggle switch: Turn ON or OFF

  • When ON: Also scrapes linked pages found on the main pages

  • When OFF: Only scrapes the exact URLs you provided

  • Useful for getting related content automatically

4. Enable Pagination

  • Toggle switch: Turn ON or OFF

  • When ON: Follows pagination links (page 1, 2, 3, etc.)

  • When OFF: Only scrapes the specific pages you listed

  • Note: Only works with pagination systems where page numbers are in the URL as query parameters

  • Example: https://example.com/articles?page=1

5. Page Key (For Pagination)

  • The parameter name used for pagination

  • Common examples: page, p, pagenum

  • Example: If the URL is https://site.com/articles?page=2, enter page

  • Only needed if pagination is enabled

6. Start Pagination Page

  • The initial page number to begin fetching pages

  • Usually starts at 1 or 0

  • Enter the first page number you want to scrape

7. End Pagination Page

  • The last page number to fetch or process

  • Enter the final page number you want to scrape

  • Be careful not to set this too high to avoid excessive costs

8. Chunk Size

  • Set how many tokens or characters should be in each chunk

  • Default value is 1024 (recommended for most cases)

  • This controls how the webpage content is divided for processing

9. Metadata (Optional)

  • Add extra information about the webpages

  • This helps the AI better understand the scraped content

  • Example: "Product documentation from Q4 2024"

  • You can leave this empty if you don't have specific metadata

10. Text Splitter (Optional)

  • Default is set to "markdown"

  • This controls how the webpage text will be divided for processing

  • Usually, you don't need to change this setting

11. Select Embedding Model

  • Default "OpenAI - Text Embedding 3 Small" is already selected

  • You can change this by clicking the dropdown if needed

  • This model processes and understands your webpage content

12. Review Cost Warning

  • ⚠️ Important: Check the yellow warning box

  • Data import costs 0.540 tokens per page

  • Consider the number of pages you're planning to scrape

13. Final Step

  • Click the "Save" button at the bottom

  • The system will scrape the specified webpages

  • The content will be ready for AI to work with

Key Features:

  • Multiple URLs: Can scrape several specific webpages at once

  • Sub-URL Fetching: Option to follow links found on the main pages

  • Pagination Support: Can automatically follow paginated content

  • Flexible Configuration: Control exactly what gets scraped

Pagination Example:

If you want to scrape a blog with pagination:

  • URLs: https://blog.example.com/articles

  • Enable pagination: ON

  • Page key: page

  • Start page: 1

  • End page: 5

This will scrape: https://blog.example.com/articles?page=1, ?page=2, etc.

Simple Summary:

  1. Give your webpage collection a name

  2. Enter the URLs you want to scrape (separated by commas)

  3. Choose if you want to fetch sub-URLs (optional)

  4. Configure pagination if needed (optional)

  5. Adjust other settings if needed (or keep defaults)

  6. Add metadata (optional)

  7. Click Save

  8. Done!

The system will scrape the specified webpages and convert the content into searchable text that AI can understand and work with!

Last updated