Internship: Data Analyst in Igola
- 4 minsOverview:
My internship experience as a data analyst at Igola:
Routine: Visualization
My routine work was to clean customer data, manipulate data for analytic purpose, and finally visualize and report the data to support operating or marketing decisions.
Sample report: 搜索分析
Project 1: Geographic Data Testing
Background
To map hotels in the app, the company needed to get accurate geographic data, longitude and latitude, from its data supplier. The way to get geographic data was simple: input hotel name
/ hotel address
in a supplier’s API, then a vector of geographic coordinate would be returned.
- Former supplier: Expensive, No matching support for hotel without English name/address
- Ideal supplier: Lower price, Support both Chinese and English matching, high data quality
They already had several providers on the list, and wanted to choose the best one, which had a high matching rate and low price. My task was to test the data quality of the suppliers.
Procedure
I set 95% matching rate as a threshold. The suppliers had more than 5% of our hotels incorrectly located would not be qualified. Then, I ranked the qualified geographic data suppliers based on the sampling matching rate.
Sampling Design
There were thousands of hotels in the database, it’s infeasible to manually test the location of each hotel one by one. Sampling from the database and getting statistics from the sample would be more effective and feasible.
- Method: Stratified sampling.
- Overall sample size:
Based on the overall sample size, the sample size of each strata could be calculated: correspondingly 68 and 14.
Validation of location
Afterwards, I sampled hotels from the database, derived corresponding geographic information from a supplier’s API, and validated if the hotel was correctly located for each sample. The work flow can be shown as below:
I utilized the build-in map service of my company’s app to validate the location. It was similar to google map. When you input longitude and latitude, it would pin the location in the map. If it located in the same building as its exact location did, I would classify as correctly located.
I validated the locations for all the sample and finally got the sampling matching rate of a supplier.
Report
I implemented the validation procedure on each supplier and calculated their sampling matching rate. And for each supplier, T-test for sample proportion was applied to see if they were qualified ( $\ge 95$ matching rate). Eventually I generated the ranking result for the qualified supplier by their sampling matching rate.
The finalized report to my manager consisted of 2 parts in Markdown
fashion:
- Detailed testing procedure
- Quality ranking list
Project 2: Web-scraping
Background
The second project I did was about web-scraping. Most Chinese internet companies consider the customers who consume on their phone more important, because a very large number of Chinese people know how to use their phone but don’t know how to use computer. So many internet company put optimizing their mobile apps at the first place, so did my company. At that stage, neither the ranking algorithm nor the labeling system for our apps was mature. To get some inspiration, my manager wanted to take a look at other popular travel apps to learn how successful apps rank and label the hotels.
Content
So I needed to scraped down the information from several popular travel apps regarding both the rank of hotels in some popular tourist cities as well as the tags for these hotels.
For instance, one city of interest of my company was Hongkong. If you input Hongkong in the search bar of a travel app, you would get a list like this:
More generally, like this:
In different apps, you would get a different hotel list by default (without any filter): different hotels standing in the first place, second place etc.. So one of my task was to get the default hotel lists of different apps.
And for each hotel in a list, they would have their tags, such as hotel name, its star, its rate, address and price:
And different apps tag their hotels in different ways:
So the other part I needed to scape was the tags.
Output
The scraped information were transferred into tidy dataframes as shown below: