Machine Learning Kantei

Hoshi · March 28, 2017

Hi all,

I'm testing the waters to know if enough data is available to try a little exotic project. I do a fair bit of work in data science and I'd be interesting in training a classifier to make approximate nihonto attributions.

For this sort of project, one needs a lot of data and it needs to be labelled. Something realistic would be to have a database of Oshigata accompanied by periods (Koto, Shinto, etc). It's unclear at this stage how much data points would be needed, but close to five hundred representative oshigata per period would be a good start to achieve accuracy above the obvious.

Another feasible project would be Godaken attributions.

Smith school attributions would require high-resolution images with proper lighting, and additional information that can't be easily captured easily by camera for "standard" images. (e.g presence/absence of Utsuri, mune thickness, etc). And lots and lots of data.

I'd be willing to undertake this little project if there is sufficient interest. Of course it would require someone who has, or is willing to digitalize a vast quantity of Oshigata for this purpose.

One solution I have been considering would be use AoiArt as a database, and continuously download their labelled data with a web crawler (with their permission). And perhaps over time we'd have a large enough dataset. But given the turnover it would take a very long time.

In principle, if we had all the existing data on swords with high quality pics and accompanying information about elements that are not present in the picture, we could have an interesting new Shinsa method. After all, what judges do is a form of very sophisticated classifying. But, in all likelihood, we will never get such immensely valuable dataset. Nihonto is unlikely to move in the age of big data so soon.

One of the most interesting aspect of training such a classifier, is that we know a lot about the proper order of judgement leading to attribution. This can be implemented into the system.

I'd be interested in pursuing this as a side hobby if there is sufficient interest.

Let me know what you think.

vajo · March 28, 2017

A very nice idea. Is there also a possibility to do OCR from a nakago picture? That would be great.

Hoshi · March 28, 2017

It's possible, of course. But probably better handled by a native Japanese speaker, in terms of setting up the pipeline, and so on. I'm sure there is a lot of OCR work related to recognizing Japanese/chinese characters in calligraphy being done.

seattle1 · March 28, 2017

Hello:

With the primacy of sugata and jihada being foremost in kantei, a focus oriented towards yakiba, if that is your intention, leads to a lot of misdirection.

Arnold F.

Brian · March 28, 2017

To rephrase what Arnold is saying, a large part of kantei and attributions are from things that cannot be seen in oshigata.
You could try to describe it, but not sure it would be very successful putting it into words. Difficult to qualify or quantify many of the aspects of art. That said...if we can assist in any way, would be glad to help.

John A Stuart · March 28, 2017

It would take all the fun out of it and it is that which does not fit the normal data file that makes it a riddle sometimes. John

Hoshi · March 28, 2017

Thank you for your input.

I think I have a far easier solution to this little project. AoiJapan.net has a collection of 1600+ swords. I wouldn't use the raw images, but mine the text provided into a comprehensive dataset. It wouldn't be an exercise in image classification as I had in mind, but it would certainly provide interesting insights and better effort invested/accuracy ratio.

Since when has AoiJapan.net been operating, btw? I recall discovering this treasure trove only recently. It's very unusual for sellers to be so transparent on things such as price realized, as this is the sort of "private" information used to provide an edge in trading. It's valuable.

Carlo Giuseppe Tacchini · March 28, 2017

In the italian Nihonto board there is a member that in many years of attempts coded a program in which you make the input (lenght, width, hada sori etc) and it gives period, school and Smith. Many time it makes atari or gives very reasonable reply when wrong. I'm no more on that board for personal reasons but might be the inventor read these lines and decide to give help or collaboration.

Anyway a member of that board outperformed the program every time.

SwordGuyJoe · March 28, 2017

Sounds like a fun little project.

Carlo Giuseppe Tacchini · March 28, 2017

Sounds like a fun little project.

It took several years of collaboration between some members of the board using the same sources suggested here, Not perfect but the last time I visited the board (years ago...) they where fine-tuning it.

Hoshi · March 28, 2017

Thank you Carlo. That's very valuable information. No need recreating the wheel from scratch when you can learn from others.

Could you send me a link to the program, or the post discussing it?

Carlo Giuseppe Tacchini · March 28, 2017

Thank you Carlo. That's very valuable information. No need recreating the wheel from scratch when you can learn from others.

Could you send me a link to the program, or the post discussing it?

The contributors were many but only the inventor had (has?) the whole thing. Inputs were made by the inventor and replies given by him and him solely. He never shared (as far as I visited the board) the program possibly because his target was someting commercial. Is for this reason that I made the call "if he reads these lines...". Anyway wait a moment and I'll give you the link to the board.

Carlo Giuseppe Tacchini · March 28, 2017

Here you go : http://www.intk-token.it/forum/index.php?act=idx

If I remember correctly the programmer's name was Paolo, but don't take it for granted.

EDIT ; Paolo Placidi is your man : http://www.intk-token.it/forum/index.php?showuser=98 but seems he had a long period away from the board.

Jussi Ekholm · March 28, 2017

Chris I think there is already tons of written data about characteristics of swords and what you should expect about various schools and smiths. I admit being sometimes quilty of "using easy search" on some of my great references when a great hint is being given and I focus on that single bit weeding out the crop (and I think your program would work in similarish fashion). Because I have great reference library and not too good eye (at least not yet) for me often text based kantei can be even easier than by picture or eye. Because usually text based kantei hints are made by very experienced person and he/she can give just the right hints to make it fun.

You will need a lot more reference blades than 500 Kotō blades for example. It will be a good start but there is so much fine details that are seen even among the same school, some times even among the works of the same smith.

I can see text based kantei program being fairly possible to make up, and it seems as Carlo wrote it is already done and tested by the INTK member. It is a massive project but if you use something like 5+ "nihonto bibles" as your database you'll get a huge amount of raw data which your program would then use. I am not a computer guy so I don't know how you set search parameters etc. but it will be a huge project of data gathering and then lots and lots of tuning.

Hoshi · March 28, 2017

The data gathering is the issue.

But for a starting run, AoiJapan.net has 1600 blades with text-base descriptors. Quality labels. When I get the time I will build a simple web-crawler and see if I can extract the data from the website.

Its not about giving a hit on a particular smith, I think that's way too advance and would require a far, far larger database which is outside the scope of this small project. The most time consuming part is gathering and cleaning the data. The rest is easy nowadays, with very powerful classifiers that run out of the box in Python or R.

How it works is quite simple.

You'd have a dataset, with blade characteritics as predictors, or features (X's), period/school/what have you as dependant variables, or labels. (Y's). From there you train the classifier on a portion of the data, and predict the kept out portion. Depending on the classifier, you'd have a probability assigned to each category. That would be a very simple first run. If its promising then it can be extended.

Carlo Giuseppe Tacchini · March 29, 2017

Just the mere publications by Markus and posts by Darcy means months of work. I'd like to see when you'll start with Fujishiro and the other textbooks Not to tell about contradicting information between usually reliable sources.

Carlo Giuseppe Tacchini · March 29, 2017

Its not about giving a hit on a particular smith, I think that's way too advance and would require a far, far larger database

Isn't impossible. Paolo did it.

Jussi Ekholm · March 31, 2017

I think we here might be able to give some assistance on database if needed.

If you are thinking about starting small, maybe you could try to focus at first to some smaller time period & maybe few schools? For example Shinshintō and larger schools during that? I was just looking at Connoisseurs and it has 14 different Major topics for that period that you could start building on using as your base guide. I think more narrower field would be a lot easier to try on your test run. The reason why I am recommending a much smaller field at first is the fact that there are so many different swordmaking schools during the history of Japan, it is a monumental task trying to handle all at the beginning.

Hoshi · April 2, 2017

Next step is to get in contact with Paolo. We'll see where to take it from here.

Jussi, I agree. But Shinshinto work isn't exactly easy to distinguish due to the steel - I was aiming for period as a start. If I can't get in touch with Paolo and get up to date with his project, I'll use the AOIart.net site to build a small database and explore what can be done with it.

Thanks everyone for your help!

Darcy · April 7, 2017

There are fundamental problems with the idea. The first is that kantei is half art, that is, being able to recognize quality when you see it and assign it to a smith of the right standing.

This is more than half the game. Sugata, jihada, etc. all of this is secondary.

Sugata can lie for a few reasons. First being that the kantei sugata rules are all based on archetypes. They describe the archetype of a time period, and the person testing you is then responsible for showing you a blade that has achetypal features. Now if you know the archetypes you can assign it to the right period and go from there.

The real world is a lot more sloppy than the kantei game. Most of the time you don't get an archetype or you get one with overlapping evidence that means you can go one way or another.

Aoe can go to Kanemitsu and Kanemitsu can go to Chogi. They are horizontal moves. But Rai Kuniyuki can also go to Rai Kunitsugu. So this is a two generation jump from the middle Kamakura period to the Nanbokucho.

In all of these cases the quality is in the right ballpark still even though the periods or schools change.

If sugata or jihada were 100% accurate then the experts even inside the NBTHK would not argue and reassign these things over time.

The second problem is that the data we have is an amalgamation of "facts" piled up over centuries of honoring the memory of sensei and not wanting to overturn the apple cart.

There is a Soshu Yukimitsu Tokubetsu Juyo tanto for sale on the internet right now. A client showed it to me and asked me what I thought and I said the attribution is wrong, just at a first glance while chatting, I said it's Sadamune.

Now why is a Sadamune carrying a Yukimitsu attribution at Tokuju? Well in the Edo period Yukimitsu was a larger bucket that a lot of Soshu work got dropped into, and he developed a reputation for a wide ranging style which resulted in even more stuff getting dropped into his bucket that shouldn't be there.

Into this bucket went many things that were Masamune, and Sadamune by current (and I think correct) viewpoints. The NBTHK will fix those when it sees them and is confident.

So I had a look at the setsumei on this after making the claim that it was Sadamune, I am not fluent in Japanese but I saw the killer phrase of 相州上工の作 (Soshu Joko no saku). This phrasing is used to not disagree with the judgment but to disagree with the judgement. It says a high level Soshu smith made this. This is a way of disagreeing by not strictly disagreeing (Yukimitsu is a high level Soshu smith) but by not coming out and saying Yukimitsu made this, they are saying that there are other possibilities. Also obviously they are thinking hard about those possibilities because they could say Den Yukimitsu otherwise to indicate if there is just a little bit plus or minus in it.

Now the kicker on this piece is that it has a Honami Kochu attribution on it. One of the two guys they are really not going to mess with. Even though he is wrong here, it means they need to remove it to reattribute it but there is too much respect for his work so it stays on which means they cannot do anything but confirm it.

Later on in this setsumei it goes on to mention Sadamune twice.

So anyone looking at the blade with enough modern knowledge should knee jerk on it the same way I did. And if you read between the lines in the setsumei they are bending over backwards to not disagree with Kochu but tactfully say it's really Sadamune.

So now if someone can buy that sword that represents a possibly great value as you can pay a Yukimitsu price and get a Sadamune... depending on what the dealer is asking of course. But as a collector even though this paper says on its face value what it says, you should be comfortable with what it is *trying* to say and to understand it is actually a Sadamune paper though we have to treat it as a Yukimitsu paper.

The opposite happens when it is a Masamune and gets the Soshu Joko no saku because it means you need to allow Sadamune (improbable), Yukimitsu (more likely), Shizu (most likely), Go (possible) or Norishige (possible). If they thought it was Masamune they would not say Soshu Joko but this is again going to be found after a Honami attribution of some sort they don't feel comfortable with overwriting. The bad news for the buyer of this Masamune is that they are not getting a great price on a Masamune as they feel they are, if they didn't do their homework, they are paying a bonus price on a Shizu most likely.

These points illustrate that attribution trends change over time, and some of those have been cleaned up, and others can't. Those same trends are in the NBTHK Juyo work and what goes one way in 1962 may be seen differently now.

I believe probably there have been many gimei blades destroyed because they were outliers, that if you were patient enough and collected them instead, you'd get enough to see it's a group on their own of a signature style. But if every time you encounter one, see none like it, and so destroy it, you have a self fulfilling prophecy because you're destroying evidence instead of cataloging it and deciding later when you have a bigger data set.

So because it requires an advanced collector to look at these things and determine what the attribution really means over its face value, you get the gist of the problem. Bad data. You start confirming old biases that should not be confirmed and you train your machine with bad data. GIGO follows.

There are just too many exceptions and not enough clean data, and many things can come down to argument with both sides having an acceptable answer. But one has to go on the front of the paper and if that feeds a machine, you don't get the big picture.

The big picture is as much reading between the lines as the lines themselves and that requires a kind of study that not many are doing because the real dataset is very large.

That leads up to another main issue that you can't train something with 500 swords and get anything reasonable out of it, because you couldn't train a human out of 500 swords and get something reasonable out of that. You can just cover some of the main archetypes and hope for guesswork to stitch it together. Part of kantei is interactive. Getting it wrong, and a hint about where to go. In a team effort, it involves getting insight and guesses from your friends. A machine can't do this so well. Certainly not on a dataset of 500 items to cover 1,000 years of sword work over all schools and traditions.

The last key problem is that kantei is not accurate once the quality drops below a certain level. This is echoed by scholars who when brave will say that the lower level the school is the more fungible the answers are and the less the particular answer given matters other than that it is at the right level.

I hate this part because this is where I list some people's favorite schools and put them into 2nd and 3rd and last-tiers so I just won't do it this time. But as you get more into the branches, more into School that moved from A to B and so on, and away from the museum class stuff, the pure fact of it is that kantei is not reliable. And that when you read an answer like Ganmaku you need to understand it doesn't mean Ganmaku straight out again. It means "wastebin taxon":

https://en.wikipedia.org/wiki/Wastebasket_taxon

Read up.

Condell actually broke me in on that as I sent in a blade that came back as Ganmaku. He said that it meant that they didn't really know but it was just some lesser work from the later Muromachi period and so they put it in this bucket. And what he was describing was the wastebin taxon.

Even the GOKADEN shows this thinking as first, it's 5 historical koto "traditions" plus shinto plus shinshinto. Those traditions are not so well defined though and schools like Ko-Hoki are shoehorned in, and some like Enju or Mihara are influenced by multiple traditions and have to go somewhere.

So what it is is a convenience, a tool to help you understand but you need to break out of the tool when you're ready. Every region had its own way of doing things, and there are far more than 5 traditions. Majiwarimono is the magical bonus tradition sometimes not named which is effectively just that. A wastebin tradition of stuff that doesn't fit the model.

All of this means that the model is no good for science.

But it is good for learning because it is trying to cover the upper level swords, things we can make progress on learning how to determine the differences. The stuff getting dumped elsewhere is part of saying they are not really important enough to worry about too much to begin with.

Now if there were not so many issues of fuzz that need human judgment to intercede you could do better. It helps in machine learning to give the machine good data and then correct it with humans.

Show Google a picture and it says "small girl" and you correct it to "doll". It's not really learning it's just making a big database of hashing images into facts. These get seeded and then corrected by people over time. But it is really not that good. Show a picture of a girl in a bikini and it doesn't know what to say. Leg? If legs are prominent. Model? Bikini? What is the real subject there? Google cannot know.

This is a human thing to look at a picture of a girl in a bikini and know if it is a model selling you a bikini, or a bikini selling you a model, or if it's just a girl taking a selfie.

Our swords, the fuzz is too great ... so many blades jump away from archetypes but should go in with the archetypes, and experts can't agree, attributions get changed over time, and the data you see requires vast interpretation to come to better conclusions on what it is. This whole thing is quicksand and is not an ideal foundation on which to build machine learning.

You need to read every NBTHK Juyo and Tokuju judgment in order to understand that the world is a lot more complex than just looking at the sugata and the hamon and going bang, this is made by X during Y period.

In my opinion. I wouldn't try to do it. A true AI can do it but AI is an abused term now. Sells clicks. Nobody has even begun to write a true AI because it's not clear where to go.

I have written "AI programs", I wrote an othello game playing program that could beat the hell out of programs running on 10x larger systems. I wrote genetic algorithms to train robot ants to follow a trail of bread crumbs to test Java and if I wanted to do anything interesting in it (answer: strong no). Some AI like Google or the Go playing program recently famous are mostly hardware solutions: throw a metric ton of old fashioned brute force calculating. it's not thinking or doing anything very close to what a human is doing. But if you get 1000 GPUs and hook them up to run machine code you hand tuned to defeat one single game, you have built a contraption that solves Go mathematically.

Math exists outside of intelligence. It's a fact of the world. So I generally frown on the desire for everyone to use it like it is a magic solution to everything.

Prolog is a language that was meant to be an inference engine which you could program with facts and it could make inferences. For instance to make diagnoses on sick patients this works good. This is another form of AI programming and the language was structured to make this easy for the programmer.

But for this to work means having reliable descriptions of symptoms and diseases. The system could then plug in the inferences and come up with good proposals about what the patient was sick with.

That is like 1985 tech and it does work.

In our problem though we just don't have that solid background of well known and universally agreed on facts. We deal with a lot of theory and ultimately this engine is going to come back at best with vague answers that people can get at a glance anyway. If there were solid facts (authors contradict each other all the time) you could do it.

Edit: also I will say that recent papers from the NBTHK that say things like "Goto" and don't say generation or branch are brave souls trying to change things by going no further than the level of confidence and not going into speculation. I got "Momoyama Goto" on one which means it is going to be one of 4,5,6th generation. Ko-Goto we know already. This is going to become more common as it is more conservative and so, more defensible. Less crowd pleasing "argh I already knew it was Goto!" but at least you are now confirmed as knowing all there is to reliably know about it. They didn't say more because more is speculating.

Sign In

Machine Learning Kantei

Recommended Posts

Join the conversation