PHPRO.ORG

Get Links With DOM

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads.php - Download
/docs.php - Documentation
/get-involved.php - Get Involved
/support.php - Help
/releases/8.4/index.php - What's new in 8.4
/lookup-form.php -
/menu.php -
/manual/en/getting-started.php - Getting Started
/manual/en/introduction.php - Introduction
/manual/en/tutorial.php - A simple tutorial
/manual/en/langref.php - Language Reference
/manual/en/language.basic-syntax.php - Basic syntax
/manual/en/language.types.php - Types
/manual/en/language.variables.php - Variables
/manual/en/language.constants.php - Constants
/manual/en/language.expressions.php - Expressions
/manual/en/language.operators.php - Operators
/manual/en/language.control-structures.php - Control Structures
/manual/en/language.functions.php - Functions
/manual/en/language.oop5.php - Classes and Objects
/manual/en/language.namespaces.php - Namespaces
/manual/en/language.enumerations.php - Enumerations
/manual/en/language.errors.php - Errors
/manual/en/language.exceptions.php - Exceptions
/manual/en/language.fibers.php - Fibers
/manual/en/language.generators.php - Generators
/manual/en/language.attributes.php - Attributes
/manual/en/language.references.php - References Explained
/manual/en/reserved.variables.php - Predefined Variables
/manual/en/reserved.exceptions.php - Predefined Exceptions
/manual/en/reserved.interfaces.php - Predefined Interfaces and Classes
/manual/en/reserved.attributes.php - Predefined Attributes
/manual/en/context.php - Context options and parameters
/manual/en/wrappers.php - Supported Protocols and Wrappers
/manual/en/security.php - Security
/manual/en/security.intro.php - Introduction
/manual/en/security.general.php - General considerations
/manual/en/security.cgi-bin.php - Installed as CGI binary
/manual/en/security.apache.php - Installed as an Apache module
/manual/en/security.sessions.php - Session Security
/manual/en/security.filesystem.php - Filesystem Security
/manual/en/security.database.php - Database Security
/manual/en/security.errors.php - Error Reporting
/manual/en/security.variables.php - User Submitted Data
/manual/en/security.hiding.php - Hiding PHP
/manual/en/security.current.php - Keeping Current
/manual/en/features.php - Features
/manual/en/features.http-auth.php - HTTP authentication with PHP
/manual/en/features.cookies.php - Cookies
/manual/en/features.sessions.php - Sessions
/manual/en/features.file-upload.php - Handling file uploads
/manual/en/features.remote-files.php - Using remote files
/manual/en/features.connection-handling.php - Connection handling
/manual/en/features.persistent-connections.php - Persistent Database Connections
/manual/en/features.commandline.php - Command line usage
/manual/en/features.gc.php - Garbage Collection
/manual/en/features.dtrace.php - DTrace Dynamic Tracing
/manual/en/funcref.php - Function Reference
/manual/en/refs.basic.php.php - Affecting PHP's Behaviour
/manual/en/refs.utilspec.audio.php - Audio Formats Manipulation
/manual/en/refs.remote.auth.php - Authentication Services
/manual/en/refs.utilspec.cmdline.php - Command Line Specific Extensions
/manual/en/refs.compression.php - Compression and Archive Extensions
/manual/en/refs.crypto.php - Cryptography Extensions
/manual/en/refs.database.php - Database Extensions
/manual/en/refs.calendar.php - Date and Time Related Extensions
/manual/en/refs.fileprocess.file.php - File System Related Extensions
/manual/en/refs.international.php - Human Language and Character Encoding Support
/manual/en/refs.utilspec.image.php - Image Processing and Generation
/manual/en/refs.remote.mail.php - Mail Related Extensions
/manual/en/refs.math.php - Mathematical Extensions
/manual/en/refs.utilspec.nontext.php - Non-Text MIME Output
/manual/en/refs.fileprocess.process.php - Process Control Extensions
/manual/en/refs.basic.other.php - Other Basic Extensions
/manual/en/refs.remote.other.php - Other Services
/manual/en/refs.search.php - Search Engine Extensions
/manual/en/refs.utilspec.server.php - Server Specific Extensions
/manual/en/refs.basic.session.php - Session Extensions
/manual/en/refs.basic.text.php - Text Processing
/manual/en/refs.basic.vartype.php - Variable and Type Related Extensions
/manual/en/refs.webservice.php - Web Services
/manual/en/refs.utilspec.windows.php - Windows Only Extensions
/manual/en/refs.xml.php - XML Manipulation
/manual/en/refs.ui.php - GUI Extensions
/downloads.php#v8.4.5 - 8.4.5
/ChangeLog-8.php#8.4.5 - Changelog
/migration84 - Upgrading
/downloads.php#v8.3.19 - 8.3.19
/ChangeLog-8.php#8.3.19 - Changelog
/migration83 - Upgrading
/downloads.php#v8.2.28 - 8.2.28
/ChangeLog-8.php#8.2.28 - Changelog
/migration82 - Upgrading
/downloads.php#v8.1.32 - 8.1.32
/ChangeLog-8.php#8.1.32 - Changelog
/migration81 - Upgrading
https://www.php.net/archive/2025.php#2025-03-13-5 - PHP 8.2.28 Released!
https://www.php.net/downloads.php - downloads page
https://windows.php.net/download/ - windows.php.net/download/
https://www.php.net/ChangeLog-8.php#8.2.28 - ChangeLog
https://www.php.net/archive/2025.php#2025-03-13-4 - PHP 8.1.32 Released!
https://www.php.net/ChangeLog-8.php#8.1.32 - ChangeLog
https://www.php.net/archive/2025.php#2025-03-13-3 - PHP 8.4.5 Released!
https://www.php.net/ChangeLog-8.php#8.4.5 - ChangeLog
https://www.php.net/archive/2025.php#2025-03-13-1 - PHP 8.3.19 Released!
https://www.php.net/ChangeLog-8.php#8.3.19 - ChangeLog
https://www.php.net/archive/2025.php#2025-02-13-2 - PHP 8.3.17 Released!
https://www.php.net/ChangeLog-8.php#8.3.17 - ChangeLog
https://www.php.net/archive/2025.php#2025-02-13-1 - PHP 8.4.4 Released!
https://www.php.net/ChangeLog-8.php#8.4.4 - ChangeLog
https://www.php.net/archive/2025.php#2025-01-17-1 - PHP 8.4.3 Released!
https://www.php.net/ChangeLog-8.php#8.4.3 - ChangeLog
https://www.php.net/archive/2025.php#2025-01-16-1 - PHP 8.3.16 Released!
https://www.php.net/ChangeLog-8.php#8.3.16 - ChangeLog
https://www.php.net/archive/2024.php#2024-12-19-3 - PHP 8.4.2 Released!
https://www.php.net/ChangeLog-8.php#8.4.2 - ChangeLog
https://www.php.net/archive/2024.php#2024-12-19-2 - PHP 8.3.15 Released!
https://www.php.net/ChangeLog-8.php#8.3.15 - ChangeLog
https://www.php.net/archive/2024.php#2024-12-19-1 - PHP 8.2.27 Released!
https://www.php.net/ChangeLog-8.php#8.2.27 - ChangeLog
https://www.php.net/supported-versions - Supported Versions
https://www.php.net/archive/2024.php#2024-11-21-4 - PHP 8.4.1 Released!
https://www.php.net/manual/en/migration84.new-features.php#migration84.new-features.core.property-hooks - Property Hooks
https://www.php.net/manual/en/migration84.new-features.php#migration84.new-features.core.asymmetric-property-visibility - Asymmetric Property Visibility
https://www.php.net/manual/en/migration84.new-features.php#migration84.new-features.core.lazy-objects - Lazy Objects
https://www.php.net/manual/en/migration84.new-features.php#migration84.new-features.pdo - PDO driver-specific subclasses
https://www.php.net/manual/en/migration84.new-classes.php#migration84.new-classes.bcmath - BCMath object type
https://www.php.net/ChangeLog-8.php#8.4.1 - ChangeLog
https://php.net/manual/en/migration84.php - migration guide
https://www.php.net/archive/2024.php#2024-11-21-3 - PHP 8.1.31 Released!
https://www.php.net/ChangeLog-8.php#8.1.31 - ChangeLog
https://www.php.net/archive/2024.php#2024-11-21-2 - PHP 8.3.14 Released!
https://www.php.net/ChangeLog-8.php#8.3.14 - ChangeLog
https://www.php.net/archive/2024.php#2024-11-21-1 - PHP 8.2.26 Released!
https://www.php.net/ChangeLog-8.php#8.2.26 - ChangeLog
https://www.php.net/archive/2024.php#2024-11-07-1 - PHP 8.4.0 RC4 available for testing
https://wiki.php.net/todo/php84 - PHP Wiki
https://downloads.php.net/~calvinb - download page
https://github.com/php/php-src/issues - bug reporting system
https://github.com/php/php-src/blob/php-8.4.0RC4/NEWS - NEWS
https://github.com/php/php-src/blob/php-8.4.0RC4/UPGRADING - UPGRADING
https://gist.github.com/NattyNarwhal/6a107fe3b862ac3e2cf03b013e151eba - the manifest
https://qa.php.net/ - the QA site
https://www.php.net/archive/2024.php#2024-10-24-3 - PHP 8.4.0 RC3 available for testing
https://downloads.php.net/~saki - download page
https://github.com/php/php-src/blob/php-8.4.0RC3/NEWS - NEWS
https://github.com/php/php-src/blob/php-8.4.0RC3/UPGRADING - UPGRADING
https://gist.github.com/SakiTakamachi/3648bbdbfeadefb8e40b4ad9592d3b6a - the manifest
https://www.php.net/archive/2024.php#2024-10-24-2 - PHP 8.2.25 Released!
https://www.php.net/ChangeLog-8.php#8.2.25 - ChangeLog
https://www.php.net/archive/2024.php#2024-10-24-1 - PHP 8.3.13 Released!
https://www.php.net/ChangeLog-8.php#8.3.13 - ChangeLog
https://www.php.net/archive/2024.php#2024-10-10-1 - PHP 8.4.0 RC2 available for testing
https://github.com/php/php-src/blob/php-8.4.0RC2/NEWS - NEWS
https://github.com/php/php-src/blob/php-8.4.0RC2/UPGRADING - UPGRADING
https://gist.github.com/NattyNarwhal/ea2bb82ce1e3fb67f385b2d6e2e085dc - the manifest
https://www.php.net/archive/2024.php#2024-09-26-4 - PHP 8.1.30 Released!
https://www.php.net/ChangeLog-8.php#8.1.30 - ChangeLog
https://www.php.net/archive/2024.php#2024-09-26-3 - PHP 8.4.0 RC 1 now available for testing
https://github.com/php/php-src/blob/php-8.4.0RC1/NEWS - NEWS
https://github.com/php/php-src/blob/php-8.4.0RC1/UPGRADING - UPGRADING
https://gist.github.com/SakiTakamachi/198dac0514f8cba37ed3506526cd1fb2 - the manifest
https://www.php.net/archive/2024.php#2024-09-26-2 - PHP 8.2.24 Released!
https://www.php.net/ChangeLog-8.php#8.2.24 - ChangeLog
https://www.php.net/archive/2024.php#2024-09-26-1 - PHP 8.3.12 Released!
https://www.php.net/ChangeLog-8.php#8.3.12 - ChangeLog
https://www.php.net/archive/2024.php#2024-09-12-1 - PHP 8.4.0 Beta 5 available for testing
https://github.com/php/php-src/blob/php-8.4.0beta5/NEWS - NEWS
https://github.com/php/php-src/blob/php-8.4.0beta5/UPGRADING - UPGRADING
https://gist.github.com/NattyNarwhal/2dc0c8a3f7bf63ec5143b4bf703ee626 - the manifest
/archive/ - Older News Entries
https://thephp.foundation/ - The PHP Foundation
https://thephp.foundation/donate/ - Donate
/conferences - Upcoming conferences
https://www.php.net/conferences/index.php#2025-03-14-1 - PHP Conference Odawara 2025
https://www.php.net/conferences/index.php#2025-03-13-1 - PHPKonf 2025
https://www.php.net/conferences/index.php#2025-02-09-1 - Laravel Live Denmark 2025
https://www.php.net/conferences/index.php#2025-01-31-1 - PHP Velho Oeste 2025
/cal.php - User Group Events
/thanks.php - Special Thanks
https://twitter.com/official_php -
https://fosstodon.org/@php -
/copyright.php - Copyright © 2001-2025 The PHP Group
/my.php - My PHP.net
/contact.php - Contact
/sites.php - Other PHP.net sites
/privacy.php - Privacy policy
https://github.com/php/web-php/blob/master/index.php - View Source
javascript:; -