Wget is a popular command-line utility that can download files from the web. It’s part of the GNU Project and, as a result, commonly bundled with numerous Linux distributions.
This article will walk you through the step-by-step process of installing and downloading files using Wget with or without proxies, covering multiple scenarios and showcasing practical examples.
Wget is a free software package that can retrieve files via HTTP(S) and FTP(S) internet protocols. The utility is part of the GNU Project. Thus, the full name is GNU Wget. The capitalization is optional (Wget or wget).
Wget can be downloaded from the official GNU channel and installed manually. However, we recommend using package managers. Package managers facilitate the installation and make future upgrades more convenient. Also, most Linux distributions are bundled with Wget.
To install Wget on Ubuntu/Debian, open the terminal and run the following command:
sudo apt-get install wget
To install Wget on CentOS/RHEL, open the terminal and run the following command:
yum install wget
If you’re using macOS, we highly recommend using the Homebrew package manager. Open the terminal and run the following command:
brew install wget
If you’re using Windows, Chocolatey package manager is a good choice. When using Chocolatey, run the following command from the command line or PowerShell:
choco install wget
Lastly, to verify the installation of Wget, run the following command:
wget --version
This will print the installed version of Wget along with other related information.
Wget command can be run from any command-line interface. In this tutorial, we’ll be using the terminal. To run the Wget command, open the terminal and enter the following:
wget -h
This will list all the options that can be used with the Wget command grouped in categories, such as Startup, Logging, Download, etc.
To download a single file, run Wget and type in the complete URL of the file. For example, the Wget binary file is located at https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz. To download this file, enter the following in the terminal:
wget https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz
Wget shows the progress of downloads
Wget shows detailed information about the file being downloaded: the download completion bar, progress of each step, total file size and its mime type, etc.
Every program, including web browsers, sends certain headers when connecting to a web service. In this case, the User-Agent header is the most important as it contains a string that identifies the program.
To see how User-Agent varies across various applications, open this URL in different browsers that you have installed.
To identify the User-Agent used by Wget, request this URL:
wget https://httpbin.org/user-agent
This command will download a file named user-agent without any extension. To view the contents of this file, use the cat command on macOS and Linux. On Windows, you can use the type command.
~$ cat user-agent
{
"user-agent": "wget/1.21.2"
}
The default User-Agent can be modified using the --header option. The syntax is as follows:
wget --header "user-agent: DESIRED USER AGENT" URL-OF-FILE
The following example should clarify it further:
~$ wget --header "user-agent: Mozilla/5.0 (Macintosh)" https://httpbin.org/user-agent
~$ cat user-agent
{
"user-agent": "Mozilla/5.0 (Macintosh)"
}
As it’s evident here, the User-Agent has changed. If you wish to send any other header, you can add more --header options followed by a header in "HeaderName: HeaderValue" format.
There are two methods for downloading multiple files using Wget. The first method is to send all the URLs to Wget separated with a space. For example, the following command will download files from all three URLs:
~$ wget http://example.com/file1.zip http://example.com/file2.zip http://example.com/file3.zip
If you wish to try a real example, use the following command:
~$ wget https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz https://ftp.gnu.org/gnu/wget/wget2-1.99.2.tar.lz
The command will download both files one at a time.
This method works well when the number of files is limited. It can become difficult to manage as the number of files grows, making the second method more useful.
The second method is to write all the URLs in a file and use the -i or --input-file option. For example, to read the URLs from the urls.txt file, run any of the following commands:
~$ wget --input-file=urls.txt
~$ wget -i urls.txt
The best part of this option is that if any of the URLs don’t work, Wget will continue and download the rest of the functional URLs.
The --input-file option of the Wget command can be expanded to extract links from a webpage.
In its simplest form, you can supply a URL that contains the links to the files. For example, this page contains links to downloadable content of Wget. To download all files from this URL, run the following:
~$ wget --input-file=https://ftp.gnu.org/gnu/wget
However, this command won’t be particularly useful without any further customization. There are multiple reasons for that.
By default, Wget does not overwrite existing files. If a download results in overwriting a file, it’ll create a new file by appending a numerical suffix. It means that for every instance of a compressed.gif file, it’ll create new files with names such as compressed.gif, compressed.gif.1, compressed.gif.2, and so on.
This behavior can be modified by specifying the --no-clobber switch to skip duplicate files.
Next, you may want to download the files recursively by specifying the --recursive switch.
Finally, you may want to skip downloading certain files by specifying the extensions as a comma-separated list to the --reject switch.
Similarly, you may want to download certain files while ignoring everything else by using the --accept switch. This also takes a list of extensions separated by a comma.
Some other useful switches are --no-directories and --no-parent. These two ensure that no directories are created, and the Wget command doesn’t traverse to a parent directory.
For example, to download all files with the .sig extension, use the following command:
~$ wget --recursive --no-parent --no-directories --no-clobber --accept=sig --input-file=https://ftp.gnu.org/gnu/wget
There are two methods for Wget proxy server integration. The first method uses command line switches to specify the proxy server and authentication details.
The easiest way to verify is to get an IP address before specifying a proxy server. To check your current IP address, run the following commands:
~$ wget https://ip.oxylabs.io/location
#output of wget here
~$ cat index.html
11.22.33.44 #prints actual IP address
The first command simply receives the index.html file containing the IP address. The cat command (or type command for Windows) prints the file contents.
The same result can be achieved by running Wget in quiet mode and redirecting the output to the terminal instead of downloading the file:
~$ wget --quiet --output-document=- https://ip.oxylabs.io/location
The shorter version of the same command is as follows:
~$ wget -q -O - https://ip.oxylabs.io/location
To utilize a proxy that doesn’t require authentication, use two -e or two --execute switches. The first will enable the proxy, and the second will specify the proxy server’s URL.
The following commands enable the proxy and specify the proxy server’s IP 12.13.14.15 and port 1234:
~$ wget -q -O- -e use_proxy=yes -e http_proxy=12.13.14.15:1234 https://ip.oxylabs.io/location
12.13.14.15
In the example above, the proxy doesn’t require authentication. If the proxy server requires user authentication, set the proxy username by using the --proxy-user switch. Similarly, set the proxy password using the --proxy-password switch:
~$ wget -q -O- -e use_proxy=yes -e http_proxy=12.13.14.15:1234 --proxy-user=your_username --proxy-password=your_password https://ip.oxylabs.io/location
As evident here, the command is quite long. However, it’s useful when you don’t want to use a proxy all the time.
The second method is to use the .wgetrc configuration file. This file can store proxy configuration, which Wget then reads.
The configuration file is located in the user’s home directory and is named .wgetrc. Alternatively, you can use any file as the configuration file by using the --config switch.
In the ~/.wgetrc file, enter the following lines:
use_proxy = on
http_proxy = http://12.13.14.15:1234
If you also need to set user authentication for the proxy, modify the file as follows:
use_proxy = on
http_proxy = http://your_username:your_password@12.13.14.15:1234
As of now, every time Wget runs, it’ll use the specified proxy.
$ wget -q -O- http://httpbin.org/ip
# Prints IP of the proxy server
The proxies can also be set with the environment variables like http_proxy. However, it isn’t specific to Wget and will apply to the entire network traffic, making it unsuitable for the task at hand.
cURL or Curl is another open-source command-line tool for downloading files and is available for free.
cURL and Wget share many similarities, but there are important distinctions differentiating the tools for specific individual purposes.
First, let’s take a quick look at the similarities. Both options:
Are open-source, command-line tools for downloading content from HTTP(S) and FTP(S)
Can send HTTP GET and POST requests
Support cookies
Are designed to run in the background
The following features are only available in cURL:
Available as a library
Support for more protocols beyond HTTP and FTP
Better SSL support
More HTTP authentication methods
Support for SOCKS proxies
Better support for HTTP POST
Nonetheless, Wget has its advantages as well:
Supports recursive. This is the most prominent advantage, allowing you to download files recursively using the --mirror switch and create a local copy of a website.
Can resume interrupted downloads
More information about what is curl is in this article. If you want to read about the differences in detail, see the cURL comparison table.
The differences listed above should help you figure out the more suitable tool for a particular scenario. For example, if you want recursive downloads, choose Wget. If you require SOCKS proxy support, pick cURL.
Neither tool is decisively better than the other. Select the one that is suitable for your specific scenario at a given moment.
This article detailed how to configure Wget, from installation and downloading single or multiple files to the methods of using proxy servers. Lastly, the comparison between cURL and Wget overviewed their differences according to the functionality and individual use cases.
If you want to find out more about how proxies and advanced public data acquisition tools work or about specific web scraping use cases, such as web scraping job postings or building a Python web scraper, check out our blog. And if, after reading this article, you're thinking of using proxies with Wget, check out our Residential Proxies and Datacenter Proxies.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Data Collection
Innovation hub
oxylabs.io© 2024 All Rights Reserved