Internship: Data Analyst in Igola

- 4 mins

Overview:

My internship experience as a data analyst at Igola:


Routine: Visualization

My routine work was to clean customer data, manipulate data for analytic purpose, and finally visualize and report the data to support operating or marketing decisions.

Sample report: 搜索分析


Project 1: Geographic Data Testing

Background

To map hotels in the app, the company needed to get accurate geographic data, longitude and latitude, from its data supplier. The way to get geographic data was simple: input hotel name/ hotel address in a supplier’s API, then a vector of geographic coordinate would be returned.

They already had several providers on the list, and wanted to choose the best one, which had a high matching rate and low price. My task was to test the data quality of the suppliers.

Procedure

I set 95% matching rate as a threshold. The suppliers had more than 5% of our hotels incorrectly located would not be qualified. Then, I ranked the qualified geographic data suppliers based on the sampling matching rate.

Sampling Design

There were thousands of hotels in the database, it’s infeasible to manually test the location of each hotel one by one. Sampling from the database and getting statistics from the sample would be more effective and feasible.

Validation of location

Afterwards, I sampled hotels from the database, derived corresponding geographic information from a supplier’s API, and validated if the hotel was correctly located for each sample. The work flow can be shown as below:

I utilized the build-in map service of my company’s app to validate the location. It was similar to google map. When you input longitude and latitude, it would pin the location in the map. If it located in the same building as its exact location did, I would classify as correctly located.
I validated the locations for all the sample and finally got the sampling matching rate of a supplier.

Report

I implemented the validation procedure on each supplier and calculated their sampling matching rate. And for each supplier, T-test for sample proportion was applied to see if they were qualified ( $\ge 95$ matching rate). Eventually I generated the ranking result for the qualified supplier by their sampling matching rate.
The finalized report to my manager consisted of 2 parts in Markdown fashion:

  1. Detailed testing procedure
  2. Quality ranking list

Project 2: Web-scraping

Background

The second project I did was about web-scraping. Most Chinese internet companies consider the customers who consume on their phone more important, because a very large number of Chinese people know how to use their phone but don’t know how to use computer. So many internet company put optimizing their mobile apps at the first place, so did my company. At that stage, neither the ranking algorithm nor the labeling system for our apps was mature. To get some inspiration, my manager wanted to take a look at other popular travel apps to learn how successful apps rank and label the hotels.

Content

So I needed to scraped down the information from several popular travel apps regarding both the rank of hotels in some popular tourist cities as well as the tags for these hotels.
For instance, one city of interest of my company was Hongkong. If you input Hongkong in the search bar of a travel app, you would get a list like this:

More generally, like this:

In different apps, you would get a different hotel list by default (without any filter): different hotels standing in the first place, second place etc.. So one of my task was to get the default hotel lists of different apps.
And for each hotel in a list, they would have their tags, such as hotel name, its star, its rate, address and price:

And different apps tag their hotels in different ways:

So the other part I needed to scape was the tags.

Output

The scraped information were transferred into tidy dataframes as shown below:


Reference

Zhijian Liu

Zhijian Liu

A foodaholic

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora