I was told about this job advertised recently and I’m a bit confused as to exactly what they are hoping to achieve, or more to the point, why?
What we need:
A data extraction guru. Someone who has expertise in screen scraping and parsing, from static pages and dynamic forms. We need someone with the ability to (a) dynamically navigate to the “account” pages of an airline website, and (b) extract the data we need and save the text into a file on our local directory…
What we will give you:
Login and password to a travel website. Details of website to be scraped will be given to winning project bidder.
What we need:
We need a script that can login at an airline website and pull user data. The script will then need to navigate to the “account” pages on the site – the pages that include key data like name, # of recent trips, etc. Then the script should scrape the page and output the data into our local directory, as simple text dump of those HTML pages.
When I first started working with airline websites five years ago, metasearch companies doing screen scraping and racking up big bills for the airlines in excessive page views, and excessive faring & availability transactions were a big concern. IT vendors have become a lot smarter since then in blocking unwanted traffic and the metasearch engines are much more likely to be getting their data by subscribing to a product like Meta Pricer rather than alienating the airlines they are hoping to earn revenue from in referral fees by scraping their sites.
But the job ad above says they are looking for someone to “login at an airline website and pull user data.” I wonder if it could be a service where people forward itineraries and the company then uses the record locator to access the booking and extract other data from the airline manage my booking page?
I recall a panic situation from maybe a year ago where there was evidence that a search engine spider was indexing sensitive customer data – in the end it was discovered that the source of the problem was no security breach on the part of the vendor or the airline; the passenger had actually put a link to his flight itinerary on his personal webpage. As this link contained the combination of his surname and PNR record locator, it was easy for the search engine to index all the information on the airline booking servicing page relevant to that passenger. You can be as security conscious as you want, but nothing will stop a passenger engaging in this type of behaviour.
Back to the ad mentioned above, I’m guessing it may have been placed by the site findnetdeals.com, but it is only a guess and I may just as likely be wrong. If anyone has any theories on what exactly the person who placed the ad is hoping to achieve and/or what their strategy may be for engaging in this type of screen scraping, I’d love to hear your thoughts.