How can i use threads and queue for concurrent download of URL links

How can i use threads and queue for concurrent download of URL links
0

Hello, im trying create a simple program in python3 with threads and queue to concurrent download images from URL links by using 4 or more threads to download 4 images at the same time and download said images in the downloads folder in the PC while avoiding duplicates by sharing the information between threads.
I suppose i could use something like URL1= “Link1”?
Here are some examples of links.

https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg

https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg

The code should be simple, it only needs to use threads with queue to download 4 images simultaneously and avoid duplicates by sharing information between threads is what i understood.

What code have you written so far to attempt this?

Some others have helped me with this code but i dont get why its not working…
import threading
import queue


if __name__ == "__main__":
    picture_queue = queue.Queue(maxsize=0)
    picture_threads = []
    picture_urls = ["https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg",
                    "https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg"]

    # create and start the threads
    for url in picture_urls:
        picture_threads.append(picture_getter(url, picture_queue))
        picture_threads[i].start()

    # wait for threads to finish
    for picture_thread in picture_threads:
        picture_thread.join()

    # get the results
    picture_list = []
    while not picture_queue.empty():
        picture_list.append(picture_queue.get())

class picture_getter(threading.Thread):
    def __init__(self, url, picture_queue):
        self.url = url
        self.picture_queue = picture_queue
        super(picture_getter, self).__init__()

    def run(self):
        print("Starting download on " + str(self.url))
        self._get_picture()

    def _get_picture(self):
        # --- get your picture --- #
        self.picture_queue.put(picture)

I posted the code in the previous comment.

Here is a code that it works partially.
What i need is that the program ask how many threads you want and then download images until it reaches image 20, but on the code if i put 5, it will only download 5 images and so on. The thing is that if i put 5, it will download 5 images first, then the following 5 and so on until 20. if its 4 images then 4, 4, 4, 4, 4. if its 6 then it will go 6,6,6 and then download the remaining 2.
Somehow i must implement queue on the code but i just learn theards a few days ago and im lost on how to mix threads and queue together.

import threading
import urllib.request
def worker(cont):
    print("The worker is ON",cont)
    image_download = "https://unab-dw2018.s3.amazonaws.com/ldp2019/"+str(cont)+".jpeg"
    download = urllib.request.urlopen(image_download)
    file_save = open("Image "+str(cont)+".jpeg","wb")
    file_save.write(download.read())
    file_save.close()
    return cont+1

threads = []
q_threads = int(input("Choose input amount of threads between 4 and 20"))
for i in range (0,q_threads):
    h = threading.Thread(target=worker, args= (i+1,))
    threads.append(h)
for i in range(0,q_threads):
    threads[i].start()

That’s what the queue is for. Your workers should never exit, but should continuously pull urls from the queue in a loop. When the queue is empty, the worker thread should exit its loop.

Mind you this approach won’t work as well if you’re adding urls to the queue while the workers are downloading, but it works fine if you have all the urls up front.

Sorry but i feel that i understand and that i dont, im not sure what you mean…
So, i put the queue thing and put those URL from 1 to 20 in the queue somehow? then have the workers pick up the queue by the amount of threads i have?.. and by loop you mean the for i in range?. Im new to python with only a few months since i started.
I wonder if you could help me out to solve this with an example?

Here’s some pseudocode, since I don’t feel like typing in all the working code:

from queue import Queue, Empty
URLS = Queue()

def worker():
  while True:
    try:
      url = URLS.get(False)
      download(url)
    except Empty:
      return;

def main():
    for url in some_list_of_urls:
        URLS.put(url)
    add_all_urls_to_queue()
    threads = int(input("how many threads?"))
    for i in range(threads):
      Thread(target=worker).start()

Again, this is unsophisticated, and requires you to add all the urls up front, but it should give you the idea. If you’re new to python, it’s not really a good idea to mess with threads directly, but rather use asyncio or some abstraction over threads such as ThreadPoolExecutor. As you get more experienced, you find you definitely don’t want to mess with threads directly.

For this problem, assuming it’s a real-world one and not an exercise, I’d encourage you to look into scrapy instead of doing it by hand like this.

1 Like

Thanks, its very late here so i will try this tomorrow and why i dont want to mess with threads?.. can i destroy my pc by mistake or something?.
Also i while i was seen the queue thing, it would seem that i need to do some synchronization with the threads somehow to avoid duplicates is what i got.
Also, its not a problem to add all the URL up front (guess you mean put all URL in the code) since i can add all URL up front in the code.
Lastly, this pseudo-code is using only using queue, will it work? or i need to add this to my code or create a new one?.
The thing i want to do is use both threading and queue.
Cheers!

You don’t want to mess with threads via the Threading library because threads are extremely tricky to manage correctly. Because you have code running at the exact same time as other code, you run into problems called “race conditions” that are extremely difficult to debug. Modules like queue are there to help tame threads, but they’re still low-level primitives that aren’t all that fun to work with.

The code I gave you is just the structure of a full working app. I’ve edited it to show the Queue.put method. Just fill in the parts for download and you’re all set.

As for avoiding duplicates, it looks like all your urls are unique already. Due to the use of the queue, the threads won’t attempt to download the same URL.

OHhh-… i thought that to work with threads i need to import threading like i did on my code.
With just your pseudo code can i concurrent download, lets say 5 images at the same time, and followed by another 5 until i get to 20?

Sorry, you do have to import the threading library too, I just forgot it. What this code does is to always be downloading 5 at a time (assuming you input 5), rather than doing them in bursts of 5 like your old code did.

BTW, you don’t have to delete your replies if you want to make an edited version. Just click the three dots below your post then the pencil icon and it will let you edit it.