Oren Eini

Learn the reasons to use application databases and explicit publishing of data in your architecture, and how to manage growth in your data scope and size.

September 26, 2022

· 8,617 Views · 7 Likes

Analyzing the GitHub Outage

Learn about what happened when GitHub was down for over 24 hours and how they handle performance disruptions.

Updated May 26, 2022

· 5,802 Views · 4 Likes

Optimizing Access Patterns for Extendible Hashing

In this article, let's take a look at optimizing access patterns for extendible hashing.

December 17, 2019

· 19,409 Views · 1 Like

Workflow Design: What You Shouldn’t Be Looking For

Getting rid of developers in favor of a perceived does-it-all tool makes sense. Until it doesn't.

Updated April 2, 2019

· 4,840 Views · 1 Like

Using TLS With Rust: Authentication

Learn how to implement the authentication portion of your network protocol in Rust.

February 12, 2019

· 7,446 Views · 3 Likes

Using TLS in Rust: Getting Async I/O With tokio (Part 2)

Check out this second installment on using Async I/O and tokio!

February 8, 2019

· 9,012 Views · 2 Likes

Data Modeling With Indexes: Introduction

Data modeling with indexes?

January 23, 2019

· 8,787 Views · 2 Likes

Using TLS With Rust (Part 2): Client Authentication

Learn more about client authentication with Rust.

January 22, 2019

· 8,149 Views · 2 Likes

Writing My Network Protocol in Rust

A dev shows us some Rust code he used to configure network protocols. Follow along and know the rust off your coding skills!

January 7, 2019

· 11,587 Views · 3 Likes

Using TLS With Rust (Part 1)

Learn more about how one dev was able to debug his own dependencies.

January 7, 2019

· 9,672 Views · 1 Like

Using OpenSSL With libuv

Learn more about using libuv to allow more than a single connection per thread — with the help of TLS.

Updated December 13, 2018

· 10,019 Views · 4 Likes

Graphs in RavenDB: Real World Use Cases

I’m going to use this post to discuss some of the options that the new features enable. Explore a simple model

November 12, 2018

· 6,662 Views · 1 Like

Why IOPS Matters for the Database

As it turns out, if you use all your IOPS burst capacity, you end up having to get your I/O through a straw.

February 8, 2018

· 11,839 Views · 3 Likes

Inventory Management in MongoDB: A Design Philosophy I Find Baffling

I know this code was written to be readable and easy to explain rather than be able to withstand the vagaries of production, but it's still a very dangerous thing to do.

June 26, 2017

· 9,652 Views · 2 Likes

Storing Secrets in Linux

Learn what an industry thought leader has to say about the cumbersome nature of storing secrets in Linux, as opposed to easier methods of storing secrets in Windows.

May 8, 2017

· 19,575 Views · 3 Likes

How Does LZ4 Acceleration Work?

LZ4 acceleration searches in wider increments, which reduces the number of potential matches it finds but also accelerates the speed and reduces the compression ratio.

April 27, 2017

· 3,566 Views · 4 Likes

REST vs. TCP

With HTTP, each call is stateless and I can’t assume anything about the other side. With TCP, on the other hand, I can make a lot of assumptions about the conversation.

January 19, 2017

· 20,574 Views · 6 Likes

Business Features vs. Technical Features

Though it may be fine and thin, there is a line between business and technical features.

December 19, 2016

· 10,102 Views · 2 Likes

Installing RavenDB 4.0 on Your Raspberry Pi 3

Here's a quick guide to get the NoSQL database RavenDB up and running on your Raspberry Pi — a possible data solution for your IoT projects.

December 7, 2016

· 6,051 Views · 2 Likes

The Design of RavenDB 4.0: The Implications of the Blittable Format

Probably the hardest part in the design of RavenDB 4.0 is that we are thinking very hard about how to achieve our goals (and in many cases exceed them) not by writing code, but by not writing code. But by arranging things so the right thing would happen. Architecture and optimization by omission, so to speak.

May 4, 2016

· 2,536 Views · 1 Like

Trie-based Routing

The major reason routing is such an expensive issue in MVC is that it does quite a lot. But, by utilizing tries, you can save on performance.

February 23, 2016

· 3,706 Views · 1 Like

Comparing Developers

recently i had to try to explain to a non technical person how i rate the developers that i work with. in technical terms, it is easy to do: int compare(deva, devb, ctx) but it is very hard to do: int compare(deva, devb); var score = evaluate(dev); what do i mean by that? i mean that it is pretty hard (at least for me), to give an objective measure of a developer with the absence of anyone to compare him to, but it very easy to compare two developers, but even so, only in a given context. an objective evaluation of a developer is pretty hard, because there isn’t much that you can objectively measure. i’m sure that no reader of mine would suggest doing something like measuring lines of code, although i wish it was as easy as that. how do you measure the effectiveness of a developer? well, to start with, you need to figure out the area in which you are measuring them. trying to evaluate yours truly on his html5 dev skills would be… a negative experience. but in their areas of expertise, measuring the effectiveness of two people is much easier. i know that if i give a particular task to joe, he will get it done on his own. but if i give it to mark, it will require some guidance, but finish it much more quickly. and scott is great at finding the root cause of a problem, but is prune to analysis paralysis unless prodded. this came up when i tried to explain why a person spending 2 weeks on a particular problem was a reasonable thing, and that in many cases you need a… spark of inspiration for certain things to just happen. all measurement techniques that i’m familiar with is subject to the observer effect, which means that you might get a pretty big and nasty surprise by people adapting their behavior to match the required observations. the problem is that most of the time, development is about things like stepping one foot after the other, getting things progressively better by making numerous minor changes that has major effect. and then you have a need for brilliance. a customer with a production problem that require someone to have the entire system in their head all at once to figure out. a way to optimize a particular approach, etc. and the nasty part is that there is very little way to actually get those sparks on inspiration. but there is usually a correlation between certain people and the number of sparks of inspiration per time period they get. and one person’s spark can lead another to the right path and then you have an avalanche of good ideas. but i’ll talk about the results of this in another post .

June 30, 2015

· 1,222 Views · 0 Likes

What is New in RavenDB 3.0: Indexing Enhancements

we talked previously about the kind of improvements we have in ravendb 3.0 for the indexing backend. in this post, i want to go over a few features that are much more visible. attachment indexing. this is a feature that i am not so hot about, mostly because we want to move all attachment usages to ravenfs. but in the meantime, you can reference the contents of an attachment during index. that can let you do things like store large text data in an attachment, but still make it available for the indexes. that said, there is no tracking of the attachment, so if it change, the document that referred to it won’t be re-indexed as well. but for the common case where both the attachments and the documents are always changed together, that can be a pretty nice thing to have. optimized new index creation. in ravendb 2.5, creating a new index would force us to go over all of the documents in the database, not just the documents that we have in that collection. in many cases, that surprised users, because they expected there to be some sort of physical separation between the collections. in ravendb 3.0, we changed things so creating a new index on a small collection (by default, less than 131,072 items) will be able to only touch the documents that belong to the collections being covered by that index. this alone represent a pretty significant change in the way we are processing indexes. in practice, this means that creating a new index on a small index would complete much more rapidly. for example, i reset an index on a production instance, it covers about 7,583 documents our of 19,191. ravendb was able to index that in just 690 ms, out of about 3 seconds overall that took for the index reset to take place. what about the cases where we have new indexes on large collections? at this point, in 2.5, we would do round robin indexing between the new index and the existing ones. the problem was that 2.5 was biased toward the new index. that meant that it was busy indexing the new stuff, while the existing indexes (which you are actually using) took longer to run. another problem was that in 2.5 creating a new index would effectively poison a lot of performance heuristics. those were built for the assumptions of all indexes running pretty much in tandem. and when we have one or more that weren’t doing so… well, that caused things to be more expensive. in 3.0, we have changed how this works. we’ll have separate performance optimization pipelines for each group of indexes based on its rough indexing position. that lets us take advantage of batching many indexes together. we are also not going to try to interleave the indexes (running first the new index and then the existing ones). instead, we’ll be running all of them in parallel, to reduce stalls and to increase the speed in which everything comes up to speed. this is using our scheduling engine to ensure that we aren’t actually overloading the machine with computation work (concurrent indexing) or memory (number of items to index at once). i’ve very proud in what we have done here, and even though this is actually a backend feature, it is too important to get lost in the minutia of all the other backend indexing changes we talked about in my previous post. explicit cartesian/fanout indexing. a cartesian index (we usually call them fanout indexes) is an index that output multiple index entries per each document. here is an example of such an index: from postcomment in docs.postcomments from comment in postcomment.comments where comment.isspam == false select new { createdat = comment.createdat, commentid = comment.id, postcommentsid = postcomment.__document_id, postid = postcomment.post.id, postpublishat = postcomment.post.publishat } for a large post, with a lot of comments, we are going to get an entry per comment. that means that a single document can generate hundreds of index entries. now, in this case, that is actually what i want, so that is fine. but there is a problem here. ravendb has no way of knowing upfront how many index entries a document will generate, that means that it is very hard to allocate the appropriate amount of memory reserves for this, and it is possible to get into situations where we simply run out of memory. in ravendb 3.0, we have added explicit instructions for this. an index has a budget, by default, each document is allowed to output up to 15 entries. if it tries to output more than 15 entries, that document indexing is aborted, and it won’t be indexed by this index. you can override this option either globally, or on an index by index basis, to increase the number of index entries per document that are allowed for an index (and old indexes will have a limit of 16,384 items, to avoid breaking existing indexes). the reason that this is done is so either you didn’t specify a value, in which case we are limited to the default 15 index entries per document, or you did specify what you believe is a maximum number of index entries outputted per document, in which case we can take advantage of that when doing capacity planning for memory during indexing. simpler auto indexes. this feature is closely related to the previous one. let us say that we want to find all users that have an admin role and has an unexpired credit card. we do that using the following query: var q = from u in session.query() where u.roles.any(x=>x.name == "admin") && u.creditcards.any(x=>x.expired == false) select u; in ravendb 2.5, we would generate the following index to answer this query: from doc in docs.users from doccreditcardsitem in ((ienumerable)doc.creditcards).defaultifempty() from docrolesitem in ((ienumerable)doc.roles).defaultifempty() select new { creditcards_expired = doccreditcardsitem.expired, roles_name = docrolesitem.name } and in ravendb 3.0 we generate this: from doc in docs.users select new { creditcards_expired = ( from doccreditcardsitem in ((ienumerable)doc.creditcards).defaultifempty() select doccreditcardsitem.expired).toarray(), roles_name = ( from docrolesitem in ((ienumerable)doc.roles).defaultifempty() select docrolesitem.name).toarray() } note the difference between the two. the 2.5 would generate multiple index entries per document, while ravendb 3.0 generate just one. what is worse is that 2.5 would generate a cartesian product, so the number of index entries outputted in 2.5 would be the number of roles for a user times the number of credit cards they have. in ravendb 3.0, we have just one entry, and the overall cost is much reduced. it was a big change, but i think it was well worth it, considering the alternative. in my next post, i’ll talk about the other side of indexing, queries. hang on, we still have a lot to go through.

September 26, 2014

· 3,704 Views · 0 Likes

Message Passing, Performance - Take 2

In my previous post, I did some rough “benchmarks” to see how message passing options behave. I got some great comments, and I thought I’ll expand on that. The baseline for this was a blocking queue, and we managed to process using that we managed to get: 145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec And the async BufferBlock, using which we got: 43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec. Using LMAX Disruptor we got a disappointing: 29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec However, it was pointed out that I can significantly improve this if I changed the code to be: var disruptor = new Disruptor.Dsl.Disruptor(() => new Holder(), new SingleThreadedClaimStrategy(256), new YieldingWaitStrategy(), TaskScheduler.Default); After which we get a very nice: 141,501,999 msgs in 00:00:10.0000051 for 14,150,193 ops/sec Another request I got was for testing this with a concurrent queue, which is actually what it is meant to do. The code is actually the same as the blocking queue, we just changed Bus to ConcurrentQueue. Using that, we got: 170,726,000 msgs in 00:00:10.0000042 for 17,072,593 ops/sec And yes, this is pretty much just because I could. Any of those methods is quite significantly higher than anything close to what I actually need.

July 16, 2014

· 4,730 Views · 0 Likes

Distributed Counters Feature Design

this is another experiment with longer posts. previously, i used the time series example as the bed on which to test some ideas regarding feature design, to explain how we work and in general work out the rough patches along the way. i should probably note that these posts are purely fiction at this point. we have no plans to include a time series feature in ravendb at this time. i am trying to work out some thoughts in the open and get your feedback. at any rate, yesterday we had a request for cassandra style counters at the mailing list. and as long as i am doing feature design series, i thought that i could talk about how i would go about implementing this. again, consider this fiction, i have no plans of implementing this at this time. the essence of what we want is to be able to… count stuff. efficiently, in a distributed manner, with optional support for cross data center replication. very roughly, the idea is to have “sub counters”, unique for every node in the system. whenever you increment the value, we log this to our own sub counter, and then replicate it out. whenever you read it, we just sum all the data we have from all the sub counters. let us outline the various parts of the solution in the same order as the one i used for time series. storage a counter is just a named 64 bits signed integer. a counter name can be any string up to 128 printable characters. the external interface of the storage would look like this: 1: public struct counterincrement 2: { 3: public string name; 4: public long change; 5: } 6: 7: public struct counter 8: { 9: public string name; 10: public string source; 11: public long value; 12: } 13: 14: public interface icounterstorage 15: { 16: void localincrementbatch(counterincrement[] batch); 17: 18: counter[] read(string name); 19: 20: void replicatedupdates(counter[] updates); 21: } as you can see, this gives us very simple interface for the storage. we can either change the data locally (which modify our own storage) or we can get an update from a replica about its changes. there really isn’t much more to it, to be fair. the localincrementbatch() increment a local value, and read() will return all the values for a counter. there is a little bit of trickery involved in how exactly one would store the counter values. for now, i think we’ll store each counter as two step values. we’ll have a tree of multi tree values that will carry each value from each source. that means that a counter will take roughly 4kb or so. this is easy to work with and nicely fit the model voron uses internally. note that we’ll outline additional requirement for storage (searching for counter by prefix, iterating over counters, addresses of other servers, stats, etc) below. i’m not showing them here because they aren’t the major issue yet. over the wire skipping out on any optimizations that might be required, we will expose the following endpoints: get /counters/read?id=users/1/visits&users/1/posts <—will return json response with all the relevant values (already summed up). { “users/1/visits”: 43, “users/1/posts”: 3 } get /counters/read?id=users/1/visits&users/1/1/posts&raw=true <—will return json response with all the relevant values, per source. { “users/1/visits”: {“rvn1”: 21, “rvn2”: 22 } , “users/1/posts”: { “rvn1”: 2, “rvn3”: 1 } } post /counters/increment <– allows to increment counters. the request is a json array of the counter name and the change. for a real system, you’ll probably need a lot more stuff, metrics, stats, etc. but this is the high level design, so this would be enough. note that we are skipping the high performance stream based writes we outlined for time series. we’ll probably won’t need them, so that doesn’t matter, but they are an option if we need them. system behavior this is where it is really not interesting, there is very little behavior here, actually. we only have to read the data from the storage, sum it up, and send it to the user. hardly what i’ll call business logic. client api the client api will probably look something like this: 1: counters.increment("users/1/posts"); 2: counters.increment("users/1/visits", 4); 3: 4: using(var batch = counters.batch()) 5: { 6: batch.increment("users/1/posts"); 7: batch.increment("users/1/visits",5); 8: batch.submit(); 9: } note that we’re offering both batch and single api. we’ll likely also want to offer a fire & forget style, which will be able to offer even better performance (because they could do batching across more than a single thread), but that is out of scope for now. for simplicity sake, we are going to have the client just a container for all of endpoints that it knows about. the container would be responsible for… updating the client visible topology, selecting the best server to use at any given point, etc. user interface there isn’t much to it. just show a list of counter values in a list. allow to search by prefix, allow to dive into a particular counter and read its raw values, but that is about it. oh, and allow to delete a counter. deleting data honestly, i really hate deletes. they are very expensive to handle properly the moment you have more than a single node. in this case, there is an inherent race condition between a delete going out and another node getting an increment. and then there is the issue of what happens if you had a node down when you did the delete, etc. this just sucks. deletion are handled normally, (with the race condition caveat, obviously), and i’ll discuss how we replicate them in a bit. high availability / scale out by definition, we actually don’t want to have storage replication here. either log shipping or consensus based. we actually do want to have different values, because we are going to be modifying things independently on many servers. that means that we need to do replication at the database level. and that leads to some interesting questions. again, the hard part here is the deletes. actually, the really hard part is what we are going to do with the new server problem. the new server problem dictates how we are going to bring a new server into the cluster. if we could fix the size of the cluster, that would make things a lot easier. however, we are actually interested in being able to dynamically grow the cluster size. therefor, there are only two real ways to do it: add a new empty node to the cluster, and have it be filled from all the other servers. add a new node by backing up an existing node, and restoring as a new node. ravendb, for example, follows the first option. but it means that in needs to track a lot more information. the second option is actually a lot simpler, because we don’t need to care about keeping around old data. however, this means that the process of bringing up a new server would now be: update all nodes in the cluster with the new node address (node isn’t up yet, replication to it will fail and be queued). backup an existing node and restore at the new node. start the new node. the order of steps is quite important. and it would be easy to get it wrong. also, on large systems, backup & restore can take a long time. operationally speaking, i would much rather just be able to do something like, bring a new node into the cluster in “silent” mode. that is, it would get information from all the other nodes, and i can “flip the switch” and make it visible to clients at any point in time. that is how you do it with ravendb, and it is an incredibly powerful system, when used properly. that means that for all intents and purposes, we don’t do real deletes. what we’ll actually do is replace the counter value with delete marker. this turns deletes into a much simple “just another write”. it has the sad implication of not free disk space on deletes, but deletes tend to be rare, and it is usually fine to add a “purge” admin option that can be run on as needed basis. but that brings us to an interesting issue, how do we actually handle replication. the topology map to simplify things, we are going to go with one way replication from a node to another. that allows complex topologies like master-master, cluster-cluster, replication chain, etc. but in the end, this is all about a single node replication to another. the first question to ask is, are we going to replicate just our local changes, or are we going to have to replicate external changes as well? the problem with replicating external changes is that you may have the following topology: now, server a got a value and sent it to server b. server b then forwarded it to server c. however, at that point, we also have a the value from server a replicated directly to server c. which value is it supposed to pick? and what about a scenario where you have more complex topology? in general, because in this type of system, we can have any node accept writes, and we actually desire this to be the case , we don’t want this behavior. we want to only replicate local data, not all the data. of course, that leads to an annoying question, what happens if we have a 3 node cluster, and one node fails catastrophically. we can bring a new node in, and the other two nodes will be able to fill in their values via replication, but what about the node that is down? the data isn’t gone, it is still right there in the other two nodes, but we need a way to pull it out. therefor, i think that the best option would be to say that nodes only replicate their local state, except in the case of a new node. a new node will be told the address of an existing node in the cluster, at which point it will: register itself in all the nodes in the cluster (discoverable from the existing node). this assumes a standard two way replication link between all servers, if this isn’t the case, the operators would have the responsibility to setup the actual replication semantics on their own. new node now starts getting updates from all the nodes in the cluster. it keeps them in a log for now, not doing anything yet. ask that node for a complete update of all of its current state. when it has all the complete state of the existing node, it replays all of the remembered logs that it didn’t have a chance to apply yet. then it announces that it is in a valid state to start accepting client connections. note that this process is likely to be very sensitive to high data volumes. that is why you’ll usually want to select a backup node to read from, and that decision is an ops decision. you’ll also want to be able to report extensively on the current status of the node, since this can take a while, and ops will be watching this very closely. server name a node requires a unique name. we can use guids, but those aren’t readable, so we can use machine name + port, but those can change. ideally, we can require the user to set us up with a unique name. that is important for readability and for being able to alter see all the values we have in all the nodes. it is important that names are never repeated, so we’ll probably have a guid there anyway, just to be on the safe side. actual replication semantics since we have the new server problem down to an automated process, we can choose the drastically simpler model of just having an internal queue per each replication destination. whenever we make a change, we also make a note of that in the queue for that destination, then we start an async replication process to that server, sending all of our updates there. it is always safe to overwrite data using replication, because we are overwriting our own data, never anyone else. and… that is about it, actually. there are probably a lot of details that i am missing / would discover if we were to actually implement this. but i think that this is a pretty good idea about what this feature is about.

March 25, 2014

· 11,227 Views · 1 Like

Time Series Feature Design: The Consensus has dRafted a Decision

So, after reaching the conclusion that replication is going to be hard, I went back to the office and discussed those challenges and was in general pretty annoyed by it. Then Michael made a really interesting suggestion. Why not put it on RAFT? And once he explained what he meant, I really couldn’t hold my excitement. We now have a major feature for 4.0. But before I get excited about that (we’ll only be able to actually start working on that in a few months, anyway), let us talk about what the actual suggestion was. Raft is a consensus algorithm. It allows a distributed set of computers to arrive into a mutually agreed upon set of sequential log records. Hm… I wonder where else we can find sequential log records, and yes, I am looking at you Voron.Journal. The basic idea is that we can take the concept of log shipping, but instead of having a single master/slave relationship, we change things so we can put Raft in the middle. When committing a transaction, we’ll hold off committing the transaction until we have a Raft consensus that it should be committed. The advantage here is that we won’t be constrained any longer by the master/slave issue. If there is a server down, we can still process requests (maybe need to elect a new cluster leader, but that is about it). That means that from an architectural standpoint, we’ll have the ability to process write requests for any quorum (N/2+1). That is a pretty standard requirement for distributed databases, so that is perfectly fine. That is a pretty awesome thing to have, to be honest, and more importantly, this is happening at the low level storage layer. That means that we can apply this behavior not just to a single database solution, but to many database solutions. I’m pretty excited about this.

March 19, 2014

· 1,739 Views · 0 Likes

Voron & Time Series Data: Getting Real Data Outputs

So far, we have just put the data in and out. And we have had a pretty good track record doing so. However, what do we do with the data now that we have it? As you can expect, we need to read it out. Usually by specific date ranges. The interesting thing is that we usually are not interested in just a single channel, we care about multiple channels. And for fun, those channel might be synchronized or not. An example of the first might be the current speed and the current engine temperature in a car. They are generally share the exact same timestamps. An example of out of sync is when you have a sensor on a rooftop measuring rainfall, and another sensor in the sewer measuring water flow rates. (Again, thanks to Dan for helping me with the domain). This is interesting, because it present quite a few interesting problems: We need to merge different streams into a unified view. We need to handle both matching and non matching sequences. We need to handle erroneous data, what happens when we have two reading for the same time for the same sensor? Yes, that shouldn’t happen, but it does. I solved this with the following API: public class RangeEntry { public DateTime Timestamp; public double?[] Values; } IEnumerable results = dts.ScanRanges(DateTime.MinValue, DateTime.MaxValue, new[] { "6febe146-e893-4f64-89f8-527f2dbaae9b", "707dcb42-c551-4f1a-9203-e4b0852516cf", "74d5bee8-9a7b-4d4e-bd85-5f92dfc22edb", "7ae29feb-6178-4930-bc38-a90adf99cfd3", }); This API gives me the results in the time order, with the same positions as the ids requested for the values. With nulls if there isn’t a value matching the value from that time in that particular sensor channel. The actual implementation relies on this method: IEnumerable ScanRange(DateTime start, DateTime end, string id) All this does it provide the entries all the entries in a particular date range, for a particular channel. Let us see how we implement multi channel scanning on top of this: private class PendingEnumerator { public IEnumerator Enumerator; public int Index; } private class PendingEnumerators { private readonly SortedDictionary> _values = new SortedDictionary>(); public void Enqueue(PendingEnumerator entry) { List list; var dateTime = entry.Enumerator.Current.Timestamp; if (_values.TryGetValue(dateTime, out list) == false) { _values.Add(dateTime, list = new List()); } list.Add(entry); } public bool IsEmpty { get { return _values.Count == 0; } } public List Dequeue() { if (_values.Count == 0) return new List(); var kvp = _values.First(); _values.Remove(kvp.Key); return kvp.Value; } } public IEnumerable ScanRanges(DateTime start, DateTime end, string[] ids) { if (ids == null || ids.Length == 0) yield break; var pending = new PendingEnumerators(); for (int i = 0; i < ids.Length; i++) { var enumerator = ScanRange(start, end, ids[i]).GetEnumerator(); if(enumerator.MoveNext() == false) continue; pending.Enqueue(new PendingEnumerator { Enumerator = enumerator, Index = i }); } var result = new RangeEntry { Values = new double?[ids.Length] }; while (pending.IsEmpty == false) { Array.Clear(result.Values,0,result.Values.Length); var entries = pending.Dequeue(); if (entries.Count == 0) break; foreach (var entry in entries) { var current = entry.Enumerator.Current; result.Timestamp = current.Timestamp; result.Values[entry.Index] = current.Value; if(entry.Enumerator.MoveNext()) pending.Enqueue(entry); } yield return result; } } We are getting a single entry from each channel into the pending enumerators. Then, we collate all the entries that share the same time into a single entry. We use the Index property to track the actual expected index of the entry in the output. And we handle duplicate times in the same channel by outputting multiple entries. Testing this on my 1.1 million records data set, we can get 185 thousands records back in 0.15 seconds.

February 25, 2014

· 4,878 Views · 0 Likes

Voron and the FreeDB Dataset

i got tired of doing arbitrary performance testing, so i decided to take the freedb dataset and start working with that. freedb is a data set used to look up cd information based on the a nearly unique disk id. this is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. that means that it is dirty . this makes it perfect to run all sort of interesting scenarios. the purpose of this post (and maybe the new few) is to show off a few things. first, we want to see how voron behaves with realistic data set. second, we want to show off the way voron works, its api, etc. to start with, i run my freedb parser, pointing it at /dev/null. the idea is to measure what is the cost of just going through the data is. we are using freedb-complete-20130901.tar.bz2 from sep 2013. after 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. reading the whole 3,328,488 albums took about a bit over ten minutes. so just the cost of parsing and reading the freedb dataset is pretty expensive. the end result is a list of objects that looks like this: now, let us see how we want to actually use this. we want to be able to: lookup an album by the disk ids lookup all the albums by an artist*. lookup albums by album title*. this gets interesting, because we need to deal with questions such as: “given pearl jam, if i search for pearl, do i get them? do i get it for jam?” for now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches. we are using the following abstraction for the destination: public abstract class destination { public abstract void accept(disk d); public abstract void done(); } basically, we read data as fast as we can, and we shove it to the destination, until we are done. here is the voron implementation: public class vorondestination : destination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private readonly jsonserializer _serializer = new jsonserializer(); private int counter = 1; public vorondestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override void accept(disk d) { var ms = new memorystream(); _serializer.serialize(new jsontextwriter(new streamwriter(ms)), d); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); if(d.artist != null) _currentbatch.multiadd(d.artist.tolower(), key, "ix_artists"); if (d.title != null) _currentbatch.multiadd(d.title.tolower(), key, "ix_titles"); if (counter%1000 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } } public override void done() { _storageenvironment.writer.write(_currentbatch); } } let us go over this in detail, shall we? in line 10 we create a new storage environment. in this case, we want to just import the data, so we can create the storage inline. on lines 13 – 15, we create the relevant trees. you can think about voron trees in a very similar manner to the way you think about tables. they are a way to separate data into different parts of the storage. note that this still all reside in a single file, so there isn’t a physical separation. note that we created an albums tree, which will contain the actual data. and ix_artists, ix_titles trees. those are indexes into the albums tree. you can see them being used just a little lower. in the accept method, you can see that we use a writebatch, a native voron notion that allows us to batch multiple operations into a single transaction. in this case, for every album, we are making 3 writes. first, we write all of the data, as a json string, into a stream and put it in the albums tree. then we create a simple incrementing integer to be the actual album key. finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes. at 60 seconds, we written 267,998 values to voron. in fact, i explicitly designed it so we can see the relevant metrics. at 495 seconds we have reads 1,995,385 entries from the freedb file, we parsed 1,995,346 of them and written to voron 1,610,998. as you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. the good thing about this is that i can physically see the various costs, it is actually pretty cool here is the voron directory at 60 seconds: you can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 mb. the compression buffer is at 32 mb (this is usually twice as big as the biggest transaction, uncompressed). the scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256mb in size. at 15 minutes, we have the following numbers: 3,035,452 entries read from the file, 3,035,426 parsed and 2,331,998 written to voron. note that we are reading the file & writing to voron on the same disk, so that might impact the read performance. at that time, we can see the following on the disk: note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. note that we needed more scratch space to handle the in flight information. the entire process took 22 minutes, start to finish. although i have to note that this hasn’t been optimized at all, and i know we are doing a lot of stupid stuff through it. you might have noticed something else, we actually “crashed” closed the voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown. we’ll actually get to play with the data in my next post. so far this has been pretty much just to see how things are behaving. and… i just realized something, i forgot to actually add an index on disk id . which means that i have to import the data again. but before that, i also wrote the following: public class jsonfiledestination : destination { private readonly gzipstream _stream; private readonly streamwriter _writer; private readonly jsonserializer _serializer = new jsonserializer(); public jsonfiledestination() { _stream = new gzipstream(new filestream("freedb.json.gzip", filemode.createnew, fileaccess.readwrite), compressionlevel.optimal); _writer = new streamwriter(_stream); } public override void accept(disk d) { _serializer.serialize(new jsontextwriter(_writer), d); _writer.writeline(); } public override void done() { _writer.flush(); _stream.dispose(); } } this completed in ten minutes, for 3,328,488 entries. or a rate of about 5,538 per / second. the result is a 845mb gzip file. i had twofold reasons to want to do this. first, this gave me something to compare ourselves to, and more to the point, i can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file. i did just that and ended up with the following: public class voronentriesdestination : entrydestination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private int counter = 1; public voronentriesdestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_diskids"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override int accept(string d) { var disk = jobject.parse(d); var ms = new memorystream(); var writer = new streamwriter(ms); writer.write(d); writer.flush(); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); int count = 1; foreach (var diskid in disk.value("diskids")) { count++; _currentbatch.multiadd(diskid.value(), key, "ix_diskids"); } var artist = disk.value("artist"); if (artist != null) { count++; _currentbatch.multiadd(artist.tolower(), key, "ix_artists"); } var title = disk.value("title"); if (title != null) { count++; _currentbatch.multiadd(title.tolower(), key, "ix_titles"); } if (counter % 100 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } return count; } public override void done() { _storageenvironment.writer.write(_currentbatch); _storageenvironment.dispose(); } } now we are actually properly disposing of things, and i also decreased the size of the batch, to see how it would respond. note that it is now being fed directly from the gzip file, at a greatly reduced cost. i also added tracking note only for how many albums we write, but also how many entries . by entries i mean, how many voron entries (which include the values we add to the index). i did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases. word of warning : i didn’t realized until after i was done with all the benchmarks, but i actually run all of those in debug configuration, which basically means that it is utterly useless as a performance metric. that is especially true because we have a lot of verifier code that runs in debug mode. so please don’t take those numbers as actual performance metrics, they aren’t valid. time # of albums # of entries 4 minutes 773,398 3,091,146 6 minutes 1,126,998 4,504,550 8 minutes 1,532,858 6,126,413 18 minutes 2,781,698 11,122,799 24 minutes 3,328,488 13,301,496 the status of the file system midway during the run. you can see that now we increase the file is smaller increments. and that we are using more scratch space, probably because we are under very heavy write load. after the run: scratch & compression are only used when the database is running, and deleted on close. the database is 7gb in side, which is quite respectable. now, to working with it, but i’ll save that for my next post, this one is long enough already.

February 20, 2014

· 3,344 Views · 0 Likes

Voron & Time Series: Working with Real Data

dan liebster has been kind enough to send me a real world time series database. the data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this. this is what this looks like: the first thing that i did was take the code in this post , and try it out for size. i wrote the following: int i = 0; using (var parser = new textfieldparser(@"c:\users\ayende\downloads\timeseries.csv")) { parser.hasfieldsenclosedinquotes = true; parser.delimiters = new[] {","}; parser.readline();//ignore headers var startnew = stopwatch.startnew(); while (parser.endofdata == false) { var fields = parser.readfields(); debug.assert(fields != null); dts.add(fields[1], datetime.parseexact(fields[2], "o", cultureinfo.invariantculture), double.parse(fields[3])); i++; if (i == 25*1000) { break; } if (i%1000 == 0) console.write("\r{0,15:#,#} ", i); } console.writeline(); console.writeline(startnew.elapsed); } note that we are using a separate transaction per line , which means that we are really doing a lot of extra work. but this simulate very well incoming events coming one at a time. we were able to process 25,000 events in 8.3 seconds. at a rate of just over 3 events per millisecond . now, note that we have in here the notion of “channels”. from my investigation, it seems clear that some form of separation is actually very common in time series data. we are usually talking about sensors or some such, and we want to track data across different sensors over time. and there is little if any call for working over multiple sensors / channels at the same time. because of that, i made a relatively minor change in voron, that allows it to have an infinite number of separate trees. that means that i can use as many trees as you want, and we can model a channel as a tree in voron. i also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. that dropped the time to insert 25,000 lines to 0.8 seconds. or a full order of magnitude faster. that done, i inserted the full data set, which is just over 1,096,384 records. that took 36 seconds. in the data set i have, there are 35 channels. i just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. that allows doing things like doing averages over time, comparing data, etc. you can see the code implementing this in the following link .

February 7, 2014

· 3,530 Views · 0 Likes

Big Data Search, Part 6: Sorting Randomness

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1. Because I wanted to ensure a consistent answer, I wrote: public static IEnumerable RandomNumbers() { const long count = 1024 * 1024 * 1024L * 1; var random = new MyRand(); for (long i = 0; i < count; i++) { if (i % 1024 == 0) { yield return 1; continue; } var result = random.NextUInt(); while (result == 1) { result = random.NextUInt(); } yield return result; } } /// /// Based on Marsaglia, George. (2003). Xorshift RNGs. /// http://www.jstatsoft.org/v08/i14/paper /// public class MyRand { const uint Y = 842502087, Z = 3579807591, W = 273326509; uint _x, _y, _z, _w; public MyRand() { _y = Y; _z = Z; _w = W; _x = 1337; } public uint NextUInt() { uint t = _x ^ (_x << 11); _x = _y; _y = _z; _z = _w; return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8)); } } I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine. For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate. The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help. Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million. That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them. Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this. For what is it worth, you can find the code for this here.

February 5, 2014

· 8,571 Views · 0 Likes

Reputation:	9526
Pageviews:	4.2M
Articles:	41
Comments:	1

About

Articles

Trend Reports