1. Project Overview and Preparation
In modern web development, it is often necessary to obtain and parse content from other web pages. This article will provide a detailed introduction to how to implement this functionality in a Vue project, including the complete process from accessing external links to parsing presentations.
1.1 Functional Requirements Analysis
We need to implement the following core functions:
- Enter the target URL in the Vue app
- Get target page content safely
- Parses HTML content and extracts the required information
- Display analysis results on the interface
- Handle possible errors and exceptions
1.2 Technology stack selection
This project will use the following techniques:
Vue 3(Composition API)
Axios (HTTP request)
DOMParser (HTML parsing)
Element Plus (UI component)
Optional: Puppeteer (processes dynamic rendering page)
1.3 Create a Vue project
npm init vue@latest page-parser cd page-parser npm install npm install axios element-plus
2. Infrastructure construction
2.1 Project Structure Design
src/
├── components/
│ ├── # Control Panel
│ ├── # Content display
│ └── # Result Viewer
├── composables/
│ └── # Analytical logic
├── utils/
│ ├── # DOM operation tool
│ └── # Content disinfection
├──
└──
2.2 Configuring Element Plus
In:
import { createApp } from 'vue' import ElementPlus from 'element-plus' import 'element-plus/dist/' import App from './' const app = createApp(App) (ElementPlus) ('#app')
3. Implement page content acquisition
3.1 Limitations of direct front-end acquisition
Due to the browser's homologous policy limitation, you will encounter CORS problems when directly obtaining content from the front end. We need to consider the following solutions:
- Using a proxy server
- Backend service gets content
- Browser extension permissions
- Enable CORS on the target website
3.2 Implementing a proxy solution
3.2.1 Front-end request code
Create composables/:
import { ref } from 'vue' import axios from 'axios' export default function usePageParser() { const htmlContent = ref('') const isLoading = ref(false) const error = ref(null) const fetchPage = async (url) => { = true = null try { // Replace with your proxy endpoint in the actual project const proxyUrl = `/api/proxy?url=${encodeURIComponent(url)}` const response = await (proxyUrl) = } catch (err) { = `Failed to get the page: ${}` ('Error fetching page:', err) } finally { = false } } return { htmlContent, isLoading, error, fetchPage } }
3.2.2 Backend Agent Implementation (Example)
// const express = require('express') const axios = require('axios') const app = express() const PORT = 3000 (()) ('/api/proxy', async (req, res) => { try { const { url } = if (!url) { return (400).json({ error: 'URL parameter is missing' }) } const response = await (url, { headers: { 'User-Agent': 'Mozilla/5.0' } }) () } catch (error) { ('Proxy Error:', error) (500).json({ error: 'Failed to get the target page' }) } }) (PORT, () => { (`The proxy server runs on http://localhost:${PORT}`) })
3.3 Handling dynamic rendering pages
For SPA or pages that load content dynamically, we need a more powerful solution:
3.3.1 Using Puppeteer Service
// Add new endpointconst puppeteer = require('puppeteer') ('/api/proxy-render', async (req, res) => { const { url } = if (!url) return (400).json({ error: 'URL parameter is missing' }) let browser try { browser = await () const page = await () await (url, { waitUntil: 'networkidle2', timeout: 30000 }) // Wait for possible content to load await ('body', { timeout: 5000 }) const content = await () (content) } catch (error) { ('Puppeteer error:', error) (500).json({ error: 'Resource page failed' }) } finally { if (browser) await () } })
3.3.2 Front-end corresponding modification
const fetchRenderedPage = async (url) => { = true try { const proxyUrl = `/api/proxy-render?url=${encodeURIComponent(url)}` const response = await (proxyUrl) = } catch (err) { = `Failed to get rendered page: ${}` } finally { = false } }
4. Page content analysis implementation
4.1 Use DOMParser to parse HTML
Add parsing logic in composables/:
const parseContent = () => { if (!) return null const parser = new DOMParser() const doc = (, 'text/html') return { title: , meta: extractMeta(doc), headings: extractHeadings(doc), paragraphs: extractParagraphs(doc), links: extractLinks(doc), images: extractImages(doc) } } // Extract meta tagsconst extractMeta = (doc) => { const metas = {} ('meta').forEach(meta => { const name = ('name') || ('property') || ('itemprop') if (name) { metas[name] = ('content') } }) return metas } // Extract the titleconst extractHeadings = (doc) => { const headings = {} for (let i = 1; i <= 6; i++) { headings[`h${i}`] = ((`h${i}`)) .map(h => ()) } return headings } // Extract paragraphsconst extractParagraphs = (doc) => { return (('p')) .map(p => ()) .filter(text => > 0) } // Extract linkconst extractLinks = (doc) => { return (('a[href]')) .map(a => ({ text: (), href: ('href'), title: ('title') || '' })) } // Extract picturesconst extractImages = (doc) => { return (('img')) .map(img => ({ src: ('src'), alt: ('alt') || '', width: , height: })) }
4.2 Advanced content extraction technology
4.2.1 Extract main content areas
const extractMainContent = (doc) => { // Try common content selectors const selectors = [ 'article', '.article', '.content', '.main-content', '.post-content', 'main', '#main' ] for (const selector of selectors) { const element = (selector) if (element) { return { html: , text: (), wordCount: ().split(/\s+/).length } } } // Heuristic: Find the element with the most text const allElements = (('body > *')) let maxTextLength = 0 let mainElement = null (el => { const textLength = ().length if (textLength > maxTextLength) { maxTextLength = textLength mainElement = el } }) return mainElement ? { html: , text: (), wordCount: ().split(/\s+/).length } : null }
4.2.2 Extract structured data (microdata, JSON-LD)
const extractStructuredData = (doc) => { // Extract JSON-LD data const jsonLdScripts = (('script[type="application/ld+json"]')) const jsonLdData = (script => { try { return () } catch (e) { ('Parse JSON-LD failed:', e) return null } }).filter(Boolean) // Extract micro data const microdata = {} ('[itemscope]').forEach(scope => { const item = { type: ('itemtype'), properties: {} } ('[itemprop]').forEach(prop => { const propName = ('itemprop') let value = ('content') || ('src') || ('href') || () if (('itemscope')) { // Nested items value = extractStructuredDataFromElement(prop) } [propName] = value }) microdata[('itemid') || ] = item }) return { jsonLd: jsonLdData, microdata } }
5. Build a user interface
5.1 Create Control Panel Components
components/:
<template> <div class="parser-controls"> <el-form @="handleSubmit"> <el-form-item label="Target URL"> <el-input v-model="url" placeholder="Enter the web address to parse" :disabled="isLoading" > <template #append> <el-button type="primary" native-type="submit" :loading="isLoading" > Analysis </el-button> </template> </el-input> </el-form-item> <el-form-item label="Resolve options"> <el-checkbox-group v-model="options"> <el-checkbox label="Extract title">title</el-checkbox> <el-checkbox label="Extract metadata">Metadata</el-checkbox> <el-checkbox label="Extract the text">text</el-checkbox> <el-checkbox label="Extract link">Link</el-checkbox> <el-checkbox label="Extract pictures">picture</el-checkbox> <el-checkbox label="Extract structured data">Structured data</el-checkbox> </el-checkbox-group> </el-form-item> <el-form-item label="Advanced Options"> <el-checkbox v-model="useRendering">Using dynamic rendering</el-checkbox> <el-tooltip content="Enable for JavaScript rendered pages"> <el-icon><question-filled /></el-icon> </el-tooltip> </el-form-item> </el-form> <el-alert v-if="error" :title="error" type="error" show-icon class="error-alert" /> </div> </template> <script setup> import { ref } from 'vue' import { QuestionFilled } from '@element-plus/icons-vue' const emit = defineEmits(['parse']) const url = ref('') const options = ref(['Extract title', 'Extract metadata', 'Extract the text']) const useRendering = ref(false) const isLoading = ref(false) const error = ref(null) const handleSubmit = async () => { if (!) { = 'Please enter a valid URL' return } try { = true = null // Verify URL format if (!isValidUrl()) { throw new Error('The URL format is invalid, please include http://or https://') } emit('parse', { url: , options: , useRendering: }) } catch (err) { = } finally { = false } } const isValidUrl = (string) => { try { new URL(string) return true } catch (_) { return false } } </script> <style scoped> .parser-controls { margin-bottom: 20px; padding: 20px; background: #fff; border-radius: 4px; box-shadow: 0 2px 12px 0 rgba(0, 0, 0, 0.1); } .error-alert { margin-top: 15px; } </style>
5.2 Create a result display component
components/:
<template> <div class="result-viewer"> <el-tabs v-model="activeTab" type="card"> <el-tab-pane label="Structured Data" name="structured"> <el-collapse v-model="activeCollapse"> <el-collapse-item v-if="" title="title" name="title" > <div class="content-box">{{ }}</div> </el-collapse-item> <el-collapse-item v-if=" && ().length" title="Metadata" name="meta" > <el-table :data="metaTableData" border> <el-table-column prop="name" label="name" width="180" /> <el-table-column prop="value" label="value" /> </el-table> </el-collapse-item> <el-collapse-item v-if=" && ().length" title="title" name="headings" > <div v-for="(headings, level) in " :key="level"> <h3>{{ () }}</h3> <ul> <li v-for="(heading, index) in headings" :key="index"> {{ heading }} </li> </ul> </div> </el-collapse-item> <el-collapse-item v-if="" title="Main Content" name="content" > <div class="content-box"> <p v-for="(para, index) in ('\n\n')" :key="index"> {{ para }} </p> </div> </el-collapse-item> <el-collapse-item v-if=" && " title="Link" name="links" > <el-table :data="" border> <el-table-column prop="text" label="text" width="180" /> <el-table-column prop="href" label="URL"> <template #default="{ row }"> <el-link :href="" rel="external nofollow" target="_blank">{{ }}</el-link> </template> </el-table-column> <el-table-column prop="title" label="title" /> </el-table> </el-collapse-item> <el-collapse-item v-if=" && " title="picture" name="images" > <div class="image-grid"> <div v-for="(img, index) in " :key="index" class="image-item"> <el-image :src="" :alt="" lazy :preview-src-list="previewImages" /> <div class="image-meta"> <p><strong>Alt:</strong> {{ || 'none' }}</p> <p><strong>size:</strong> {{ }}×{{ }}</p> </div> </div> </div> </el-collapse-item> <el-collapse-item v-if=" && " title="JSON-LD" name="jsonLd" > <pre>{{ (, null, 2) }}</pre> </el-collapse-item> </el-collapse> </el-tab-pane> <el-tab-pane label="Raw HTML" name="html"> <div class="html-viewer"> <el-button type="primary" size="small" @click="copyHtml" class="copy-btn" > copyHTML </el-button> <pre>{{ htmlContent }}</pre> </div> </el-tab-pane> </el-tabs> </div> </template> <script setup> import { computed, ref } from 'vue' import { ElMessage } from 'element-plus' const props = defineProps({ result: { type: Object, required: true }, htmlContent: { type: String, default: '' } }) const activeTab = ref('structured') const activeCollapse = ref(['title', 'meta', 'content']) const metaTableData = computed(() => { return ( || {}).map(([name, value]) => ({ name, value })) }) const previewImages = computed(() => { return ( || []).map(img => ) }) const copyHtml = () => { () .then(() => ('HTML has been copied')) .catch(() => ('Copy failed')) } </script> <style scoped> .result-viewer { background: #fff; padding: 20px; border-radius: 4px; box-shadow: 0 2px 12px 0 rgba(0, 0, 0, 0.1); } .content-box { padding: 10px; background: #f5f7fa; border-radius: 4px; white-space: pre-wrap; } .image-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(200px, 1fr)); gap: 15px; } .image-item { border: 1px solid #ebeef5; border-radius: 4px; padding: 10px; } .image-meta { padding-top: 8px; font-size: 12px; } .html-viewer { position: relative; } .copy-btn { position: absolute; top: 10px; right: 10px; z-index: 1; } pre { background: #f5f7fa; padding: 15px; border-radius: 4px; max-height: 500px; overflow: auto; margin-top: 10px; } </style>
5.3 Main page integration
:
<template> <div class="page-parser-app"> <el-container> <el-header> <h1>Web content analysis tool</h1> </el-header> <el-main> <parser-controls @parse="handleParse" /> <el-skeleton v-if="isLoading" :rows="10" animated /> <template v-else> <result-viewer v-if="result" :result="result" :html-content="htmlContent" /> <el-empty v-else description="Enter the URL and click the parse button to start" /> </template> </el-main> <el-footer> <p>© 2023 Web page parsing tool - For learning only</p> </el-footer> </el-container> </div> </template> <script setup> import { ref } from 'vue' import ParserControls from './components/' import ResultViewer from './components/' import usePageParser from './composables/usePageParser' const { htmlContent, isLoading, error, fetchPage, parseContent } = usePageParser() const result = ref(null) const handleParse = async ({ url, useRendering }) => { try { if (useRendering) { await fetchRenderedPage(url) } else { await fetchPage(url) } = parseContent() } catch (err) { ('Resolution failed:', err) } } </script> <style> .page-parser-app { min-height: 100vh; } .el-header { background-color: #409EFF; color: white; display: flex; align-items: center; justify-content: center; } .el-footer { text-align: center; padding: 20px; color: #666; font-size: 14px; } .el-main { max-width: 1200px; margin: 0 auto; padding: 20px; } </style>
6. Safety and Optimization
6.1 Content disinfection and treatment
Create utils/:
// Simple HTML disinfection functionexport function sanitizeHtml(html) { const div = ('div') = html return } // More comprehensive disinfection (consider the use of the DOMPurify library in actual projects)export function sanitizeHtmlAdvanced(html) { const allowedTags = ['p', 'br', 'b', 'i', 'strong', 'em', 'ul', 'ol', 'li', 'h1', 'h2', 'h3', 'h4'] const doc = new DOMParser().parseFromString(html, 'text/html') const removeDisallowed = (node) => { ().forEach(child => { if (!(())) { () } else { // Remove all attributes while ( > 0) { ([0].name) } removeDisallowed(child) } }) } removeDisallowed() return }
6.2 Performance optimization
6.2.1 Virtual scrolling to process large amounts of data
<template> <el-table :data="tableData" style="width: 100%" height="500" :row-height="50" :virtual-scroll="true" > <!-- Column definition --> </el-table> </template>
6.2.2 Using Web Worker to handle large documents
Create workers/:
= function(e) { const { html } = const parser = new DOMParser() const doc = (html, 'text/html') // Execute parsing logic... ({ result: parsedData }) }
Use in components:
const parseWithWorker = (html) => { return new Promise((resolve) => { const worker = new Worker('./workers/', { type: 'module' }) ({ html }) = (e) => { resolve() () } }) }
6.3 Error handling and user feedback
Enhanced error handling mechanism:
const handleParse = async ({ url, useRendering }) => { try { = true = null = null // Verify URL if (!isValidUrl(url)) { throw new Error('Invalid URL format, please include http://or https://') } // Check whether the URL is reachable const isReachable = await checkUrlReachability(url) if (!isReachable) { throw new Error('The target URL is not accessible, please check whether the network or URL is correct') } // Get content const html = useRendering ? await fetchRenderedPage(url) : await fetchPage(url) // parse content = await parseContent(html) ElNotification({ title: 'Analysis was successful', message: `Successfully analyzed ${url}`, type: 'success' }) } catch (err) { ('Resolution failed:', err) = ElNotification({ title: 'Resolution failed', message: , type: 'error', duration: 0, showClose: true }) } finally { = false } } const checkUrlReachability = async (url) => { try { const response = await (url, { timeout: 5000 }) return < 400 } catch { return false } }
7. Advanced Feature Extensions
7.1 Custom parsing rules
// Add inconst customRules = ref([]) const addCustomRule = (rule) => { (rule) } const applyCustomRules = (doc) => { return (rule => { try { const elements = () return { name: , result: (elements).map(el => { const data = {} (field => { data[] = (el) }) return data }) } } catch (err) { return { name: , error: } } }) } // Use in parseContentconst parseContent = () => { // ...Other parsing logic return { // ...Other results customData: applyCustomRules(doc) } }
7.2 Save and load parsing configuration
// Save configurationconst saveConfig = (config) => { ('parserConfig', (config)) } // Load configurationconst loadConfig = () => { const config = ('parserConfig') return config ? (config) : null } // Use in componentsonMounted(() => { const savedConfig = loadConfig() if (savedConfig) { = = } }) const handleParse = async (params) => { saveConfig(params) // ...Parse logic}
7.3 Export the parsing results
const exportResults = (format = 'json') => { if (!) return let content, mimeType, extension switch (format) { case 'json': content = (, null, 2) mimeType = 'application/json' extension = 'json' break case 'csv': content = convertToCsv() mimeType = 'text/csv' extension = 'csv' break case 'html': content = generateHtmlReport() mimeType = 'text/html' extension = 'html' break default: throw new Error('Unsupported export format') } const blob = new Blob([content], { type: mimeType }) const url = (blob) const a = ('a') = url = `page-analysis-${new Date().toISOString()}.${extension}` () (url) } // Add export button in the ResultViewer component<el-button-group class="export-buttons"> <el-button @click="exportResults('json')">ExportJSON</el-button> <el-button @click="exportResults('csv')">ExportCSV</el-button> <el-button @click="exportResults('html')">ExportHTML</el-button> </el-button-group>
8. Testing and debugging
8.1 Unit Test Example
// import { extractMeta, extractHeadings } from '../composables/usePageParser' describe('HTML parsing function', () => { test('Extract meta tag', () => { const doc = new DOMParser().parseFromString(` <html> <head> <meta name="description" content="Test Page"> <meta property="og:title" content="OG Title"> </head> </html> `, 'text/html') const meta = extractMeta(doc) expect().toBe('Test Page') expect(meta['og:title']).toBe('OG title') }) test('Extract title', () => { const doc = new DOMParser().parseFromString(` <html> <body> <h1>Main title</h1> <h2>subtitle1</h2> <h2>subtitle2</h2> </body> </html> `, 'text/html') const headings = extractHeadings(doc) expect(headings.h1).toEqual(['Main Title']) expect(headings.h2).toEqual(['Subtitle 1', 'Subtitle 2']) }) })
8.2 E2E Test
// parser. describe('Page parsing tool', () => { it('Successfully parsed page', () => { ('/') ('input').type('') ('Analysis').click() ('.el-skeleton').should('exist') ('.el-skeleton', { timeout: 10000 }).should('') ('title').should('exist') }) it('Show error message', () => { ('/') ('Analysis').click() ('URL parameter is missing').should('exist') }) })
8.3 Debugging skills
1. Use Chrome Developer Tools:
- Check network requests
- Debug proxy server response
- View the parsed DOM structure
2. Logging:
const debug = ref(false) const log = (...args) => { if () { ('[Parser]', ...args) } } // Use in parsing functionsconst parseContent = () => { log('Start parsing HTML content') // ...Parse logic}
Performance analysis:
const measureTime = async (name, fn) => { const start = () const result = await fn() const duration = () - start (`${name} time consuming: ${(2)}ms`) return result } // Use exampleconst html = await measureTime('Get Page', () => fetchPage(url))
9. Deployment and Production Environment Considerations
9.1 Build a production version
npm run build
9.2 Proxy server deployment
server:
- Manage processes using PM2
- Configure Nginx reverse proxy
- Set environment variables
deploy:
# Dockerfile FROM node:16 WORKDIR /app COPY package*.json ./ RUN npm install COPY . . RUN npm run build EXPOSE 3000 CMD ["node", ""]
9.3 Security Configuration
Restrict proxy access:
// const allowedDomains = ['', ''] ('/api/proxy', async (req, res) => { const { url } = const domain = new URL(url).hostname if (!(domain)) { return (403).json({ error: 'No access to this domain name' }) } // ...Continue proxy logic})
Rate limit:
const rateLimit = require('express-rate-limit') const limiter = rateLimit({ windowMs: 15 * 60 * 1000, // 15 minutes max: 100 // Each IP limits 100 requests}) ('/api/proxy', limiter)
HTTPS configuration:
const https = require('https') const fs = require('fs') const options = { key: (''), cert: ('') } (options, app).listen(443)
10. Summary and Best Practices
10.1 Summary of key points
Architecture design:
- The front and back end separation, use the proxy to solve CORS problem
- Component design improves maintainability
- Combined function encapsulation core logic
Functional implementation:
- Supports static and dynamic page acquisition
- Comprehensive content analysis capabilities
- Flexible results display and export
Performance optimization:
- Virtual scrolling to process large data volume
- Web Worker handles complex parsing
- A reasonable caching strategy
10.2 Best Practices
Security:
- Always disinfect user input and parse results
- Domain names that restrict proxy access
- Implement rate limits to prevent abuse
User experience:
- Provides clear loading status
- Detailed error feedback
- Customizable parsing options
Maintainability:
- Modular code structure
- Comprehensive test coverage
- Detailed documentation comments
10.3 Expanding ideas
Enhanced parsing capabilities:
- Add PDF/Word document parsing
- Support RSS/Atom subscription
- Image OCR recognition
Integrate other services:
- Save to database
- Send to the data analysis platform
- Integrate into the workflow system
AI enhancement:
- Automatically classify content
- Summary generation
- Sentiment Analysis
Through this guide, you have mastered the complete process of accessing and parsing web content in your Vue project. From basic implementation to advanced features, from security to performance optimization, this solution can meet the needs of most web content parsing and provides a good expansion foundation.
The above is the detailed content of the complete guide in Vue to access designated links and parse page content. For more information about Vue to access designated links and parse pages, please follow my other related articles!