- limit to 1000 webpages per website (Don) [URGENT]
- run scrapy on the ScrapingHub to collect two kinds of data (Alvin)
- hyperlinks between websites (implemented)
- texts per websites (Don)
- Remove irrelevant websites (Jenkins) -> Naive Bayes to filter based on words (Don) OR
- Find the source of institution descriptions (very difficult)
-
For hyperlinks
- visualization: D3.js - each node represents a data science institution - color-coded to represent the type of data science institution - size to represent the significance of institution - each link represent the collaboration between two institutions - width of the link represent the strength of the tie - lack of the link between two nodes should indicate the lack of collaboration
- network analysis
-
For texts
- classification: build LDA in Apache Spark (Louie and Don)
- visualization: Word cloud for each bucket
- performance: AWS EC2 instance (Alvin)
- Change the name of project from "labs" to "dsi-ecosystem-mapping" and the name of file from "dlab.py" to "web-data-collector.py" without changing the settings