joekgamer Posted June 3, 2011 Report Posted June 3, 2011 Search engines trawl the web, looking for links to pages. Using this, they then 'rank' pages, based on the links pointing to it. With this data, wouldn't it be possible to make a directional graph that is essentially a 'map' of the internet? I might be able to make a bot to get the links, depending on how complex it is (I haven't attempted anything similar, so I don't know how difficult it is), but would anyone know of an open-source bot or even where I could simply download the data? And are there any recommendations for graphing programs? Quote
CraigD Posted June 3, 2011 Report Posted June 3, 2011 ... but would anyone know of an open-source bot or even where I could simply download the data? And are there any recommendations for graphing programs?I've only every worked with web crawlers (AKA spiders, webbots, etc.) that I wrote myself on top of nothing but TCP/IP, and used for limited-purpose data gathering, none of which were "industrial strength" enough to "map the topology" of all the HTML and more dynamic links of the WWW the way commercial ones like googlebot or slurp can. Writing your own from scratch is a learning experience and a lot of work! This wikipedia section has a list of open source crawlers. Though I've never tried it, I imagine graphing the vast number of connections into something that's much value as other than visual art of sort would be challenging. Though the most well-know maps of this kind are of the network - the actual routers and servers that host the internet - you might get some ideas and advice from sources like the Opte Project, where they've made some very cool-looking maps, eg: I'm very interested in seeing what you come up with - good luck :thumbs_up Quote
C1ay Posted June 3, 2011 Report Posted June 3, 2011 ... but would anyone know of an open-source bot or even where I could simply download the data? And are there any recommendations for graphing programs? I used ht://dig for several years. It's a mature crawler that does a fair job. Stick it on an old linux box in the corner and let it surf. With the acquired data I suggest placing it in a database like PostgreSQL where you could do statistics and plotting with R. CraigD 1 Quote
alexander Posted June 7, 2011 Report Posted June 7, 2011 and don't forget space... lots and lots of space :)on the creepy crawlies: Google bots actually run on clusters of machines with stupendous amounts of memory, they gather the data to memory, where it is then further analyzed by other programs and algorithms, and then eventually stored into the Google. It has been said that Google data-centers actually have enough ram to take a snapshot of the internet, which is very impressive, and no information ever gets deleted. Now, here are some scary thoughts (warned ya), any website you put up, will be crawled within minutes, indexed, cached and out of your direct control, so any website you put up, can be found, and read if the need arises, regardless of how long it has been gone (since like the 90s or something) or how long it was in operation for (even less then an hour) so be thoughtful of what you put up, because you wont be able to take it back... Quote
alexander Posted June 7, 2011 Report Posted June 7, 2011 you can create a map of the internet, well sort of, it depends on what this "map" is? firstly not all links are static, so not all routes go all the places, and if you talk about ajax, that goes out of the window entirely, or you would have to write a very complicated crawler to make that work. But why reinvent the wheel there are always places to find maps done by other people, for example, here is networks map, i.e. large hosts and their interconnects (something that would be a lot easier to map precisely then what you are asking ofcourse) peer map and then there is always this map here Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.