Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
199 views
in Technique[技术] by (71.8m points)

java disc based hashmap

I'm working on a web crawler (please don't suggest an existing one, its not an option). I have it working the way it is expected to. My only issue is that currently I'm using a sort of server/client model where by the server does the crawling and processes the data, it then put it in a central location.

This location is an object create from a class i wrote. Internally the class maintains a hashmap defined as HashMap<String, HashMap<String, String>>

I store data in the map making the url the key (i keep these unique) and the hasmap value stores the corresponding data fields for that url such as title,value etc

I occasionally serialize the internal objects used but the spider is multi threaded and as soon as i have say 5 threads crawling the memory requirements go up exponentially.

To so far the performance has been excellent with the hashmap, crawling 15K urls in 2.r minutes with about 30 seconds CPU time so i really don't need to be pointed in the direction of an existing spider like most forum users have suggested.

Can anyone suggest a a fast disc based solution that will probably support concurrent reading & writing? The data structure doesnt have to be the same, just needs to be able to store related meta tag values together etc.

thanks in advance

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I suggest using EhCache for this, even though what you're building isn't really a cache. EhCache allows you to configure the cache instance so that it overflows to disc storage, while keeping the most recent items in memory. It can also be configured to be disc-persistent, i.e. data is flushed to disc on shutdown, and read back into memory at startup. On top of all that, it's key-value based, so it already fits your model. It supports concurrent access, and since the disk storage is managed as a separate thread, you shouldn't need to worry about disk access concurrency.

Alternatively, you could consider a proper embedded database such as Hypersonic (or numerous others of a similar style), but that's probably going to be more work.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...