How can i use threads and queue for concurrent download of URL links

Treyon · June 8, 2019, 3:39am

Hello, im trying create a simple program in python3 with threads and queue to concurrent download images from URL links by using 4 or more threads to download 4 images at the same time and download said images in the downloads folder in the PC while avoiding duplicates by sharing the information between threads.
I suppose i could use something like URL1= “Link1”?
Here are some examples of links.

“https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg”

“https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg”

The code should be simple, it only needs to use threads with queue to download 4 images simultaneously and avoid duplicates by sharing information between threads is what i understood.

Treyon · June 8, 2019, 11:28pm

Some others have helped me with this code but i dont get why its not working…
import threading
import queue


if __name__ == "__main__":
    picture_queue = queue.Queue(maxsize=0)
    picture_threads = []
    picture_urls = ["https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg",
                    "https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg"]

    # create and start the threads
    for url in picture_urls:
        picture_threads.append(picture_getter(url, picture_queue))
        picture_threads[i].start()

    # wait for threads to finish
    for picture_thread in picture_threads:
        picture_thread.join()

    # get the results
    picture_list = []
    while not picture_queue.empty():
        picture_list.append(picture_queue.get())

class picture_getter(threading.Thread):
    def __init__(self, url, picture_queue):
        self.url = url
        self.picture_queue = picture_queue
        super(picture_getter, self).__init__()

    def run(self):
        print("Starting download on " + str(self.url))
        self._get_picture()

    def _get_picture(self):
        # --- get your picture --- #
        self.picture_queue.put(picture)

Treyon · June 9, 2019, 5:06pm

I posted the code in the previous comment.

Treyon · June 10, 2019, 12:20am

Here is a code that it works partially.
What i need is that the program ask how many threads you want and then download images until it reaches image 20, but on the code if i put 5, it will only download 5 images and so on. The thing is that if i put 5, it will download 5 images first, then the following 5 and so on until 20. if its 4 images then 4, 4, 4, 4, 4. if its 6 then it will go 6,6,6 and then download the remaining 2.
Somehow i must implement queue on the code but i just learn theards a few days ago and im lost on how to mix threads and queue together.

import threading
import urllib.request
def worker(cont):
    print("The worker is ON",cont)
    image_download = "https://unab-dw2018.s3.amazonaws.com/ldp2019/"+str(cont)+".jpeg"
    download = urllib.request.urlopen(image_download)
    file_save = open("Image "+str(cont)+".jpeg","wb")
    file_save.write(download.read())
    file_save.close()
    return cont+1

threads = []
q_threads = int(input("Choose input amount of threads between 4 and 20"))
for i in range (0,q_threads):
    h = threading.Thread(target=worker, args= (i+1,))
    threads.append(h)
for i in range(0,q_threads):
    threads[i].start()

chuckadams · June 10, 2019, 1:37am

That’s what the queue is for. Your workers should never exit, but should continuously pull urls from the queue in a loop. When the queue is empty, the worker thread should exit its loop.

Mind you this approach won’t work as well if you’re adding urls to the queue while the workers are downloading, but it works fine if you have all the urls up front.

Treyon · June 10, 2019, 1:47am

Sorry but i feel that i understand and that i dont, im not sure what you mean…
So, i put the queue thing and put those URL from 1 to 20 in the queue somehow? then have the workers pick up the queue by the amount of threads i have?.. and by loop you mean the for i in range?. Im new to python with only a few months since i started.
I wonder if you could help me out to solve this with an example?

chuckadams · June 10, 2019, 1:59am

Here’s some pseudocode, since I don’t feel like typing in all the working code:

from queue import Queue, Empty
URLS = Queue()

def worker():
  while True:
    try:
      url = URLS.get(False)
      download(url)
    except Empty:
      return;

def main():
    for url in some_list_of_urls:
        URLS.put(url)
    add_all_urls_to_queue()
    threads = int(input("how many threads?"))
    for i in range(threads):
      Thread(target=worker).start()

Again, this is unsophisticated, and requires you to add all the urls up front, but it should give you the idea. If you’re new to python, it’s not really a good idea to mess with threads directly, but rather use asyncio or some abstraction over threads such as ThreadPoolExecutor. As you get more experienced, you find you definitely don’t want to mess with threads directly.

For this problem, assuming it’s a real-world one and not an exercise, I’d encourage you to look into scrapy instead of doing it by hand like this.

Treyon · June 10, 2019, 2:12am

Thanks, its very late here so i will try this tomorrow and why i dont want to mess with threads?.. can i destroy my pc by mistake or something?.
Also i while i was seen the queue thing, it would seem that i need to do some synchronization with the threads somehow to avoid duplicates is what i got.
Also, its not a problem to add all the URL up front (guess you mean put all URL in the code) since i can add all URL up front in the code.
Lastly, this pseudo-code is using only using queue, will it work? or i need to add this to my code or create a new one?.
The thing i want to do is use both threading and queue.
Cheers!

chuckadams · June 10, 2019, 2:26am

You don’t want to mess with threads via the Threading library because threads are extremely tricky to manage correctly. Because you have code running at the exact same time as other code, you run into problems called “race conditions” that are extremely difficult to debug. Modules like queue are there to help tame threads, but they’re still low-level primitives that aren’t all that fun to work with.

The code I gave you is just the structure of a full working app. I’ve edited it to show the Queue.put method. Just fill in the parts for download and you’re all set.

As for avoiding duplicates, it looks like all your urls are unique already. Due to the use of the queue, the threads won’t attempt to download the same URL.

Treyon · June 10, 2019, 2:29am

OHhh-… i thought that to work with threads i need to import threading like i did on my code.
With just your pseudo code can i concurrent download, lets say 5 images at the same time, and followed by another 5 until i get to 20?

chuckadams · June 10, 2019, 2:38am

Sorry, you do have to import the threading library too, I just forgot it. What this code does is to always be downloading 5 at a time (assuming you input 5), rather than doing them in bursts of 5 like your old code did.

BTW, you don’t have to delete your replies if you want to make an edited version. Just click the three dots below your post then the pencil icon and it will let you edit it.