Scrapping websites

Scrapping websites
0

#1

i have a project in mind and zero programming experience. I’d like to scrape websites for real estate listings, collecting data points fro the listing, and providing a daily output of what is new. Where would I want to start to develop project to accomplish this? Just looking for a very high level outline so I can find a starting point.

Thx in advance, John.


#2

If you have 0 programming experience Id consider looking to learn how to program as a good first step.

I personally would recommend learning python. Why Python? Python is known for being very approachable and great to learn as a starting language due to its clear syntax and flexibility

The other key reason(s) I would recommend python is because it has a huge amount of libraries which would help you on this sort of project.

  • libraries for web scraping? Yup Beautiful Soup
  • daily outputs? (I assume in a text file?) Yup (cron job, or website, or something)
  • data analysis Yup, python is known for being great for data science (if you cared about having python do more than give you data, it can be easily used to get info from the data)

You obviously can do this in other languages (like Nodejs, or Java) but in terms of getting started, getting going fast and staying focused on you current requirements (zero programming experience, webscraping, collecting data, and using said data) python is one of the best choices, and is a very popular choice too.

Goodluck :smiley:


#3

I have a better idea. Instead of scraping copyrighted material from someone else’s page to use for your own purposes, get in touch with a local Multiple Listing Service (MLS) which owns the listings and find out about IDX and RETS. This would allow you direct access to the data when developing a site for a real estate agency or agent in a specific area.

Serving real estate listings on websites is the core part of my business and my sites are designed to make it difficult for scrapers from stealing the data like you are hoping to do. There are perfectly legimate ways of doing what you want, but you are going to need to learn about backend technologies such as databases (SQL or No-SQL based) and one of many backend languages (JavaSript using Node, Python, Ruby, PHP or many others.

It is easy to detect a scraper and easy to find out who is doing the scraping with a little investigative work.


#4

Brad, thanks. This is what I was hoping to receive as feedback. It’s a helpful outline and nudge in a particular direction. I am interested to see if others concur.:fu:


#5

Randel, I appreciate the feedback. It sounds like you’re an expert on MLS listings and developing listing websites for brokers. However, before you imply that my intended gathering of information is an illegitimate effort, perhaps you could ask me to clarify my intended use? I have no interest in MLS properties or the brokering of said properties or facilitating any broker or agent in their sales activities. There are many types of real estate and parties with an interest at any given time. Don’t be so presumptuous and cast my interest and intentions as anything other than legitimate in a public forum without ascertaining contextual facts.


#6

You may not be aware, but it can still be considered illegal activity in some states to scrape regardless of your intent. If a website has as a very specific terms of service about no scraping, then you are in violation.

After reading my post again, it could be read as making certain assumptions of your intent. I apologize for that, but you still need to be aware your actions could have dire legal consequences if you are not careful which sites you scrape. Again, my point is that it can be considered illegal if you do not have permission regardless of what you plan to do with the data.

Here is a good article on this subject for you to review.

I have some documentation on accessing real estate data via PHP and RETS that I can send you via private message if you are interested.