Data Collection: [URGENT]

limit to 1000 webpages per website (Don) [URGENT]
run scrapy on the ScrapingHub to collect two kinds of data (Alvin)
1. hyperlinks between websites (implemented)
2. texts per websites (Don)

Data Cleansing:

Remove irrelevant websites (Jenkins) -> Naive Bayes to filter based on words (Don) OR
Find the source of institution descriptions (very difficult)

Data Analysis & Visualization:

For hyperlinks
- visualization: D3.js - each node represents a data science institution - color-coded to represent the type of data science institution - size to represent the significance of institution - each link represent the collaboration between two institutions - width of the link represent the strength of the tie - lack of the link between two nodes should indicate the lack of collaboration
- network analysis
For texts
- classification: build LDA in Apache Spark (Louie and Don)
- visualization: Word cloud for each bucket
- performance: AWS EC2 instance (Alvin)

Misc.

Change the name of project from "labs" to "dsi-ecosystem-mapping" and the name of file from "dlab.py" to "web-data-collector.py" without changing the settings